Blog — All Posts

Programming data for display: the PDF story

At the 2017 Papers We Love Conf, a previously-scheduled speaker fell ill. With just 90 minutes to go before the vacant slot, the organizers asked me if I could fill in. I didn't have any sort of appropriate talk prepared, but given my long history working with PDF documents, I thought I'd be able to put together a reasonably-entertaining presentation on the history, heritage, and design decisions that led to the PDF file format and specification while living up to the high standards and expectations of the Papers We Love community.

I was so relieved that the result was well-received!

How to tell if a PDF contains extractable text

If a PDF document appears to contain text when you view it on-screen, it's possible for that text to actually just be an image of text, especially if the PDF was generated by scanning a paper document. Knowing whether a PDF is image-based or not is very important if you plan on extracting text or structured data from it or other PDFs from the same source: image-based documents will need OCR pre-processing to add a layer of real characters in the PDF before text or structured data recovery will succeed.

In this post, you'll learn how to easily tell in seconds whether a PDF document is image-based, or contains extractable text.

Why is getting data out of PDF documents so hard?

PDF documents are everywhere. Unfortunately, while they are useful and pleasant for people, programmatic extraction of data from PDFs is incredibly challenging. This post talks about the history and structure of the PDF format, why it's so ill-suited for data interchange, and what needs to happen in order to make PDF data recovery a reliable part of modern programming and business processes.