TableSeer: Searching and Ranking PDF Table Data
Researchers at Penn State's College of Information Sciences and Technology's Cyber-Infrastructure Lab have developed open source software called TableSeer that can find, extract, search, and rank table data from PDF files. Source code will be available at the project's close.
Here's an extract from the press release:
Tables are an important data resource for researchers. In a search of 10,000 documents from journals and conferences, the researchers found that more than 70 percent of papers in chemistry, biology and computer science included tables. Furthermore, most of those documents had multiple tables.
But while some software can identify and extract tables from text, existing software cannot search for tables across documents. That means scientists and scholars must manually browse documents in order to find tables-a time-consuming and cumbersome process.
TableSeer automates that process and captures data not only within the table but also in tables' titles and footnotes. In addition, it enables column-name-based search so that a user can search for a particular column in a table.
In tests with documents from the Royal Society of Chemistry, TableSeer correctly identified and retrieved 93.5 percent of tables created in text-based formats. . . .
Information on TableSeer can be found in a paper, "TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries," by Ying Liu, Kun Bai, Mitra and Giles of the Penn State College of Information Sciences and Technology.
Latest posts in Digital Libraries
- Dean Krafft Named Cornell University Library's Chief Technology Strategist - July 24th, 2008
- TASI Updates Numerous Digitization Advice Documents - July 22nd, 2008
- Strategies for Sustaining Digital Libraries Published - June 23rd, 2008
Latest posts in Open Source Software
- OpenCollection Version 0.54-3 Released - August 8th, 2008
- Switzerland: Test Your ISP's Net Neutrality - August 3rd, 2008
- Zotero 1.5 Sync Preview Released - July 10th, 2008
Latest posts in Search Engines
- SRU Open Search: Open Source Customizable Interface for Displaying SRU-Formatted XML - July 10th, 2008
- Solr Search Engine Plug-In for Fedora Released - June 20th, 2008
- Coverage of the Demise of Microsoft's Mass Digitization Project - June 9th, 2008




























