Sunday August 12, 01:55 PM
NRI scientist's search engine ranks tables by title, document content and text reference |
|
By ANI
Washington, Aug 12 (ANI): An Indian origin researcher at Penn State University has developed a search engine, which not only identifies and extracts tables from PDF documents, but also indexes and ranks the search results using factors that include the table's title, text references to the table and the date of its publication.
TableRank, the engine's (TableSeer's) innovative ranking algorithm is also capable of identifying tables found in frequently cited documents, said Prasenjit Mitra, assistant professor in the Penn State College of Information Sciences and Technology (IST).
"TableSeer makes it easier for scientists and scholars to find and access the important information presented in tables, and as far as we know, is the first search engine for tables," said Prof. Mitra, an IIT Kharagpur alumnus, and one of the lead researchers in the development of the search engine.
In a search of 10,000 documents from conferences, Prof. Mitra and his team found that more than 70 percent of papers in chemistry, biology and computer science included tables. Furthermore, most of those documents had multiple tables.
But while some software could identify and extract tables from text, they couldn't search for tables across documents, which meant that scientists and scholars had to manually browse documents in order to find tables-a time-consuming and cumbersome process.
TableSeer, on the other hand, automates that process and captures data not only within the table but also in tables' titles and footnotes. In addition, it enables column-name-based search so that a user can search for a particular column in a table, said Prof. Mitra.
In tests with documents from the Royal Society of Chemistry, TableSeer correctly identified and retrieved 93.5 percent of tables created in text-based formats, he said.
Prof. Mitra said TableSeer can be tested online at http://chemxseer.ist.psu.edu, but the source code will be made available near the completion of the project.
As of now, Prof. Mitra and his team are carrying on research to improve the ranking algorithm by adding additional features. They are also working on a search engine that is capable of identifying, extracting and ranking figures found in documents.
The study, "TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries," was presented at the recent 16th International World Wide Web Conference in Alberta, Canada. (ANI)
|