Knowledge in PDFs

Thursday, May 08th, 2008 by hinrich


Most of the scientific journals nowadays have electronic versions of their articles available in PDF format. While this makes it much easier to do literature research - textual content of PDFs can be extracted and indexed - it stops at this stage. I see a problem in the fact that PDFs are a very nice format to provide people with a document that looks exactly like a printed version (i.e. it has the same layout), while it is not easy to work further with those documents electronically (e.g. the text of the article is mixed with text from tables, figure legends, page numbers, page headers & footers, etc.). We need to be able to use computer technology (such as natural language processing / text mining) to assist us in producing knowledge based on the individual scientific findings presented in different articles. If the PDF document could be adjusted so that text would be readily available, this could prove to be very useful.

Posted in Science