Blog — All Posts

How to tell if a PDF contains extractable text

If a PDF document appears to contain text when you view it on-screen, it's possible for that text to actually just be an image of text, especially if the PDF was generated by scanning a paper document. Knowing whether a PDF is image-based or not is very important if you plan on extracting text or structured data from it or other PDFs from the same source: image-based documents will need OCR pre-processing to add a layer of real characters in the PDF before text or structured data recovery will succeed.

In this post, you'll learn how to easily tell in seconds whether a PDF document is image-based, or contains extractable text.

Why is getting data out of PDF documents so hard?

PDF documents are everywhere. Unfortunately, while they are useful and pleasant for people, programmatic extraction of data from PDFs is incredibly challenging. This post talks about the history and structure of the PDF format, why it's so ill-suited for data interchange, and what needs to happen in order to make PDF data recovery a reliable part of modern programming and business processes.