How to tell if a PDF contains extractable text
(…or is rendered using a series of inaccessible images)
If a PDF document appears to contain text when you view it on-screen, it's possible for that text to actually just be an image of text, especially if the PDF was generated by scanning a paper document. Knowing whether a PDF is image-based or not is very important if you plan on extracting text or structured data from it or other PDFs from the same source: image-based documents will need OCR pre-processing to add a layer of real characters in the PDF before text or structured data recovery will succeed.
In this post, you'll learn how to easily tell in seconds whether a PDF document is image-based, or contains extractable text.
As I discuss elsewhere, the "data model" provided by PDF documents is an impoverished one if your goal is to extract text or structured data from them. In short, every PDF is a sequence of instructions describing how to render each page to a (relatively low-level) graphics context. In order to produce a page that looks like a document you'd recognize — an invoice, a tax form, a resumé — PDF offers only a couple of options for rendering text:
- Painting text, one character at a time, using either a native font and a standard text encoding, or an embedded font that carries a valid mapping between it and a standard text encoding.
- Painting "text", one character at a time, using an embedded image-based font,
where no mapping is provided between that font's organization and a standard
text encoding like
ISO-8859-1, or similar.
- Painting a single image for each page, each containing pixels that look like text.
Let's run through how to tell when each method has been used in rendering a PDF document. The only tools you'll need are:
- Any PDF reader; I'll be demonstrating using Acrobat Reader, but Apple's Preview, Evince on Linux, or any similar app will do.
- Any text editor.
How to identify a PDF that contains text ready for extraction
- Open the sample PDF in your PDF reader.
- Attempt to select some text; if the range of characters you choose are highlighted as you'd expect, great! If not, you're probably looking at an image-based PDF.
- Once you've selected some text, copy it to your clipboard (
Cmd-C, depending on your operating system).
- Now switch to your text editor, and paste the text the PDF reader extracted
from the source document (
Cmd-V). If you see the same characters that you selected, great! You have a PDF that contains extractable text.
On the other hand, if you see other characters, probably unusual glyphs or nonsense words, etc., then you have a PDF that contains "text" unsuitable for extraction.
In summary, if you can select and then copy-and-paste text from a PDF document and the pasted result contains the same characters as what you selected, then you have a PDF ready for text and structured data extraction.
Identifying a PDF that contains "text" unsuitable for extraction / recovery
Following the same procedure as above, if the text that you paste into your text editor is "junk" — nonsense words, strange glyphs, or perhaps even whitespace — then your PDF document does not contain extractable text.
In this case, text is being rendered using an image-based font embedded in the document. Such fonts often have custom encodings that aren't included in the PDF, and so the character codes used to render text cannot be used to extract that text.
Unfortunately, this case is equivalent to fully image-based PDF documents: an OCR preprocessing step will be needed before any text or structured data extraction will be possible.
How to identify an image-based PDF document
Following the same procedure as above, if you cannot even select text in a source PDF, then the document is almost surely image-based. This is extremely common for scanned PDF documents, where PDF is just being used as a container for a series of page scans.
Scanned PDF documents need an OCR preprocessing step before any text or structured data extraction will be possible.
Options for recovering text or structured data from image-based PDFs
Being able to reliably extract text content or structured data from image-based PDFs (whether the result of scanning paper documents, or PDFs that use image-based fonts as discussed above) requires first applying an OCR (Optical Character Recognition) process to those PDFs. An OCR pass will add a layer of real text over top the original page images, which a tool or service like PDFDATA.io can then use to extract bodies of text, or use as the basis for structured data recovery.
(The output of an OCR process over PDF that produces a text-enriched PDF is sometimes called a "searchable PDF", since it can be readily indexed and found by e.g. a CMS or search engine.)
We don't yet offer an integrated OCR step via PDFDATA.io, but if you are planning on using our services to drive your text and structured data extraction, ask us which OCR tool or vendor would be most appropriate for your situation. There are many different OCR tools and vendors, each of which have different tradeoffs when it comes to the types of documents for which they are best suited.