NAV
curl Node

Introduction

PDFDATA.io is an HTTP API that provides PDF data extraction as-a-service. Our goal is make it easy for you to access the data you care about that is held within PDF documents, in a way that can be readily integrated into your applications and services. More concretely, working with PDFDATA.io consists of:

  1. Providing access to your PDF source documents
  2. Describing which data from those documents should be extracted
  3. Retrieving the extracted data

Doing this is accomplished by interacting with the PDFDATA.io HTTP API, which incorporates many of the best principles of REST design:

This design makes it straightforward to interact with the API using any language's or tool's modern HTTP client. For example, sample interactions are shown throughout this documentation using curl to the right; analogous usage can be readily constructed using similar tools, or in any language using its HTTP client library.

Setup

The examples in this documentation are written using the curl command-line HTTP client. curl is probably available via your operating system's package manager, or direct download.

While you can use curl itself (as part of a shell or cmd.exe script), these examples are provided in part because curl invocations are widely understood as representations of HTTP requests. If we do not yet provide a client library for your programming language, you should be able to use these example curl interactions to guide your integration with the PDFDATA.io API using your language's HTTP client implementation.

Install via npm:

npm install pdfdata

Source for pdfdata-node is available via GitHub.

pdfdata-node is our Node.js client for PDFDATA.io. It is idiomatic, promise-based JavaScript library that any Node developer can have up and running in less than a minute. Add it to your project's package.json file, or npm install pdfdata to start working with it.

Authentication & Credentials

Throughout this documentation, sample code will use a dummy test API key:

test_xkH4xlrO7K80J5CTwEjBeSo6

If you login or register, your test API key will be shown in all code samples instead.

Authenticating your requests to the PDFDATA.io API requires providing your credentials, an API key associated with your account. You can manage your API keys in the dashboard. API keys enable access to your source documents and extracted data, so be sure to keep them secret! Do not leave your API keys in source control repositories, client-side code, and other widely-accessible areas. We will honor any API request that includes your credentials as being authorized by you, so protect them accordingly.

There are two classes of API keys that can be associated with a PDFDATA.io account, test and live. Which one you use will control what limits will be applied to your usage of the API, and what charges (if any) will result from that usage. Each class of API key is easily identified by its prefix (either test_ or live_).

Providing your API credentials

curl https://api.pdfdata.io/v1/ \
  -u test_xkH4xlrO7K80J5CTwEjBeSo6: \

Requests to the PDFDATA.io API must carry credentials, and must be made via HTTPS; any that do not will result in an error.

API requests are authenticated by HTTP Basic Auth. Your API key is the HTTP Basic username; the password is empty / not provided.

var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");

The pdfdata node module exports a single function that requires your API key to produce a usable client object. It's not possible to use pdfdata-node without providing your API key, and all communications between it and PDFDATA.io are secured by HTTPS / TLS.

Quick Start

Extracting metadata from a single PDF document

curl https://api.pdfdata.io/v1/procs \
  -u test_xkH4xlrO7K80J5CTwEjBeSo6: \
  -F file=@test.pdf \
  -F operations='[{"op":"metadata"}]'
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.procs.configure()
  .withFiles(["test.pdf"])
  .operation({op:"metadata"})
  .start()
  .then(console.log)

Example extracted metadata response

{
  "type": "proc",
  "id": "proc_1555579d8df",
  "created": "2016-06-15T19:11:36Z",
  "source_tags": [],
  "operations": [{
    "op": "metadata"
  }],
  "status": "complete",
  "documents": [{
    "type": "doc",
    "id": "doc_8e96ec0533ac3e1e988b7d1ca27bfdc096b82ddc",
    "filename": "test.pdf",
    "tags": [],
    "created": "2016-06-15T19:11:36Z",
    "expires": "2016-07-15T19:15:32Z",
    "results": [{
      "op": "metadata",
      "data": {
        "Creator": "SPDF",
        "Title": "Microarray Gene Expression Data with Linked Survival Phenotypes",
        "Producer": "AppendPro 3.0 Linux 7 SPDF_1085 May 15 2003",
        "Subject": "Center for Bioinformatics & Molecular Biostatistics",
        "ModDate": "2006-01-24T20:38:13Z",
        "Keywords": "Diffuse large-B-cell lymphoma; Gene harvesting; Least angle regression",
        "CreationDate": "2006-01-24T20:38:13Z",
        "Author": "Mark R. Segal"
      }
    }]
  }]
}

Once your environment is set up, you can get results from PDFDATA.io very quickly and with just a couple lines of code.

Nearly every PDF document contains metadata, and PDFDATA.io provides an operation that extracts it. Applying that operation to a sample PDF document you have available is the "Hello World" of PDFDATA.io integration and usage.

The code snippet shown here does this in a single request by

  1. uploading a source document named test.pdf located in the current directory
  2. configuring a single data extraction operation
  3. creating a proc, which will apply that operation to the source document

Because the metadata extraction operation is very lightweight and requires relatively little processing, a proc consisting only of applying it to a single document will almost always complete before the default timeout for the request's response expires, so that response will almost always contain the completed extraction results.

Once you've completed this "Hello World" sort of task, you'll want to learn about which operations you'll need to get the data you want from your source PDFs, how to work with documents separately from procs, and how procs are long-running processes, which carries significant implications for how you'll interact and integrate with the PDFDATA.io API in a production environment.

Security and Privacy

The PDFDATA.io API is built provide exceptional data extraction quality, features not readily available elsewhere, packaged so as to be easily usable in any modern programming language and application context. We can do this only because the API is primarily a managed, hosted service provided by us.

This means that, when you use PDFDATA.io, your source PDF documents and the data it extracts from them are necessarily held by us for a time. We take this responsibility very seriously, and thus apply the same set of security standards, engineering practices, and policies to every PDFDATA.io user, regardless of service level:

  1. Your data is yours. It is never sold to or shared with outside organizations, except in the case of infrastructure vendors we use to help us provide PDFDATA.io.
  2. Data you provide to us (either uploaded PDF documents, or the data we extract from them for you) is always transmitted securely (encrypted via HTTPS).
  3. Further, all data you provide to us automatically expires, i.e. is purged from our systems. This expiration defaults to 30 days for uploaded PDF documents, and 90 days for extracted data.

Usage limits

Your usage of the API is governed by whether or not you have activated a PDFDATA.io service plan, and which class of API key you use to authenticate requests to the API.

PDFDATA.io API usage is moderated only in terms of the number of PDF documents from which you can extract data within a given month. Please consult the descriptions of available plans for details, or your dashboard to see exactly what limits your account's API keys have. What follows here is general guidance only.

When you register on PDFDATA.io (which is free), you will only have access to a test API key in your account at first, which will have a very stringent source document limits. The intent of this is to allow new visitors to PDFDATA.io to familiarize themselves with how the service and API works.

After you activate a PDFDATA.io service plan, your account is provisioned a live API key (with limits set according to your chosen plan). In addition, your test API key's limits are extended to roughly 5% of your plan's full allotment of PDF source documents per month. For example, if your chosen plan includes 5,000 documents / month, usage authenticated by your live API key will draw down that allotment, and additional charges may occur if usage exceeds that allotment. Meanwhile, usage authenticated by your test API key will be limited to ~250 documents / month.

The intent of this is to allow you and your team to have a way to go about development and testing of your applications and systems with a dedicated PDFDATA.io API key that doesn't impinge upon what your service plan enables for you in production environments.

Errors

PDFDATA.io's HTTP API uses standard HTTP status codes and explanatory JSON bodies to indicate API request failures. All API-level errors can be broadly separated into problems with the request (these will provoke responses with 4xx status codes), or unexpected problems on our part in processing the request (resulting in a 5xx status code).

Core Concepts & Types

As outlined in the introduction, working with PDFDATA.io consists of:

  1. Providing access to your PDF source documents
  2. Describing which data from those documents should be extracted via declarative representations of the corresponding operations that collectively form a proc
  3. Retrieving the data extracted by the proc; depending on the selected operations, that data might be structured JSON, while other data might be binary resources

These bolded terms are the four fundamental concepts represented in PDFDATA.io's API, as well as the four primary "types" of objects you'll encounter when working with the API; an in-depth discussion of them dominates the remainder of this guide. There are a couple of additional data shapes worth discussing here:

Timestamps

The only atomic value understood and produced by PDFDATA.io aside from those defined by JSON itself are timestamps. Timestamps are encoded in an ISO 8601 string format that corresponds to the standard JavaScript Date format:

YYYY-MM-DDTHH:mm:ssZ

Timestamps will always include all components of this string formatting (e.g. timestamps will never omit the time or seconds components), and will always be in UTC.

Documents

PDF documents are the raw material from which PDFDATA.io extracts data. You need to get your PDF documents into PDFDATA.io in order for it to be able to read them; this is done by uploading your documents via the API, which create document objects that carry some limited metadata about them. As a shortcut, you can also upload documents, create a proc, and receive the extracted data all in a single request.

Once PDFDATA.io has access to your source documents, you can refer to them using their unique identifiers or associated tags when creating data extraction procs.

Example Document object

{
  "type": "doc",
  "id": "doc_8e96ec0533ac3e1e988b7d1ca27bfdc096b82ddc",
  "filename": "annual_report.pdf",
  "tags": ["acquired:2017-05-28"],
  "created": "2017-05-28T12:37:53Z",
  "expires": "2017-06-27T12:37:53Z"
}
Document object attributes
type
id

Once in PDFDATA.io, each document is assigned a unique identifier, a cryptographic hash of its contents prefixed with the object's type (doc_). Thus, the same document provided to PDFDATA.io multiple times will always be assigned the same identifier).

filename

The filename for this document within PDFDATA.io. This will be the same as the filename of the document from when you uploaded it, or when PDFDATA.io obtained it from an integrated storage service.

tags

A collection of string tags associated with this document. These tags can be added when documents are provided to PDFDATA.io, and used later to start procs over groups of documents that share a given tag.

created

The time when PDFDATA.io first acquired this document.

expires

The time when the document itself will expire out of storage. This is calculated to be 30 days from when this document was most recently provided to PDFDATA.io. After that point, the document will not be available for inclusion in new procs until it is provided again.

All operations over documents are mediated through the /v1/documents resource and its descendants.

Document lifecycle

Every document carries an expiration date, which is initially set to 30 days from when it is uploaded. This expiration window is reset each time a proc is started that sources data from a document; in this way, documents that are being actively touched by proc activity at least once every 30 days will remain in the system in perpetuity.

When a document expires, the PDF itself is purged from PDFDATA.io's systems, but the document record itself remains for reporting and billing purposes.

Uploading documents

Definition

POST https://api.pdfdata.io/v1/documents
pdfdata.documents.upload({ARRAY_OF_PDF_PATHS}, [{ARRAY_OF_TAGS}]);

Example request

curl https://api.pdfdata.io/v1/documents \
  -u test_xkH4xlrO7K80J5CTwEjBeSo6: \
  -F file=@{PATH_TO_PDF} \
  -F file=@{PATH_TO_PDF2} \
  -F tag={TAG_STRING}
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.documents.upload(["{PATH_TO_PDF}", "{PATH_TO_PDF2}"],
                         ["{TAG_STRING}"])
    .then(function (response) {
        // handle response
    })
    .catch(function (error) {
        // handle failure
    });

Documents are uploaded by sending a multipart/form-data POST request, which can contain one or many documents along with their filenames.

Request parameters  
file Each file parameter's value should consist of the raw binary content of a source PDF document. The filename document attribute is sourced from the filename property of each file request parameter. Any provided type attribute is ignored; every range of file data is assumed to be a PDF document.
tag optional Each tag parameter is added to the uploaded documents' set of tags. Tags can be used to easily refer to groups of source PDF documents when creating new procs. Tags may not contain whitespace, or be the empty string.
Parameters  
documents An Array of string paths to the source PDF documents to upload. The filename document attribute is sourced from the filename of the named local file. The content of each file is not checked at upload-time; every file is assumed to be a PDF document.
tags optional Each string tag is added to the uploaded documents' set of tags. Tags can be used to easily refer to groups of source PDF documents when creating new procs. Tags may not contain whitespace, or be the empty string.

The document upload response

Example response

[{
  "type": "doc",
  "filename": "document.pdf",
  "created": "2016-06-11T18:23:33Z",
  "expires": "2016-07-11T18:23:33Z",
  "tags": ["acquired:2016-06-11"],
  "id": "doc_8e96ec0033ac3e1e988b7d1ca27bfdc096b82ddc"
}, {
  "type": "doc",
  "filename": "document2.pdf",
  "created": "2016-06-11T18:23:33Z",
  "expires": "2016-07-11T18:23:33Z",
  "tags": ["acquired:2016-06-11"],
  "id": "doc_a5d8e5d1b99ac891226acb35f24a9f8f8eda50df"
}]

When successful, document uploads will yield a 201 Created HTTP response, the body of which will be a JSON-encoded array of the document objects corresponding to the uploaded PDF files.

When successful, document uploads yield a response consisting of an Array of document objects corresponding to the uploaded PDF files.

Document objects will always be given at least one tag indicating the date of their initial creation (i.e. the first time the corresponding PDF has been seen by PDFDATA.io, like "acquired:2016-06-11"), in addition to any tags explicitly specified in the upload request.

If a document has been provided to PDFDATA.io previously, then uploading it again will do up to three things:

  1. The corresponding document object's expires attribute will be "reset" to 30 days from the time of the most recent upload.
  2. At least one tag will be added to the document object, as described above.
  3. If the document had previously expired and the source PDF file had therefore been removed from storage, uploading again restores it and makes it available for inclusion in new procs.

Retrieving documents

Definition

GET https://api.pdfdata.io/v1/documents/{DOCUMENT_ID}
pdfdata.documents.get("{DOCUMENT_ID}");

Example request

curl https://api.pdfdata.io/v1/documents/{DOCUMENT_ID}
  -u test_xkH4xlrO7K80J5CTwEjBeSo6:
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.documents.get("{DOCUMENT_ID}")
    .then(function (response) {
        // handle response
    })
    .catch(function (error) {
        // handle failure
    });

Example response

{
  "type": "doc",
  "filename": "document.pdf",
  "created": "2016-06-11T18:23:33Z",
  "expires": "2016-07-11T18:23:33Z",
  "tags": ["acquired:2016-06-11"],
  "id": "doc_8e96ec0033ac3e1e988b7d1ca27bfdc096b82ddc"
}

A single document object can be retrieved, given its id. (Note well that this returns only the document object, not the previously-uploaded source PDF document.) If the id is valid, the response will be the JSON representation of the single document object.

Listing Documents

Definition

GET https://api.pdfdata.io/v1/documents/
pdfdata.documents.list([options]);

Example request

curl https://api.pdfdata.io/v1/documents/ \
  -u test_xkH4xlrO7K80J5CTwEjBeSo6:
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.documents.list()
    .then(function (response) {
        // handle response
    })
    .catch(function (error) {
        // handle failure
    });

Multiple document objects can be retrieved by "listing" the top-level /v1/documents resource.

Multiple document objects can be retrieved by "listing" the documents object.

Example response

[{
  "type": "doc",
  "filename": "document.pdf",
  "created": "2016-06-11T18:23:33Z",
  "expires": "2016-07-11T18:23:33Z",
  "tags": ["acquired:2016-06-11"],
  "id": "doc_8e96ec0033ac3e1e988b7d1ca27bfdc096b82ddc"
}, {...},
   {...}]

Document listing responses consist of a JSON-encoded array of document objects, in descending order of creation date (i.e. most-recently-acquired documents first). Note that since documents objects' created attribute is immutable and is set when a document is first uploaded, a document object's position within the listing will never change, even if its corresponding source PDF document is uploaded multiple times.

Example request

curl https://api.pdfdata.io/v1/documents/ \
  -d before=2016-06-11T18:23:33Z \
  -u test_xkH4xlrO7K80J5CTwEjBeSo6:
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.documents.list({before:"2016-06-11T18:23:33Z"})
    .then(function (response) {
        // handle response
    })
    .catch(function (error) {
        // handle failure
    });

Example response

[{
  "type": "doc",
  "filename": "report-1819.pdf",
  "created": "2016-06-10T06:13:18Z",
  "created": "2016-07-10T06:13:18Z",
  "tags": ["acquired:2016-06-10"],
  "id": "doc_4a60c27d4653a359478f410be8c45617c59870db"
}, {...},
   {...}]

The number of document objects returned in each listing response is capped. To retrieve the next "page" of documents, another listing request must be made, with a before parameter equal to the created attribute value of the last document in the previous listing response.

Request parametersOptions  
before optional An ISO 8601-formatted date/timestamp. Valid precisions are a full date + time of day UTC (e.g. 2016-06-11T18:23:33Z), or a simple date (e.g. 2016-07-13, which implies midnight UTC for the time component). The "page" of document objects returned in response will begin with the document bearing the latest created attribute prior (but not equal) to any provided before parameter value. Alternatively, before can be a JavaScript Date object; in this case, the full date + time of day are utilized.

When omitted, the page of documents provided in response will begin with the most recently-created document object.

Procs

A proc (short for "process") is the combination of a set of source PDF documents and a set of descriptions of data extraction operations to be applied to those documents. Once a proc is complete, the results are available as some combination of structured JSON data and binary resources, the structure of which depends on which operations were selected to be part of the proc.

PDFDATA.io offers many operations, and more are being added all the time. Detailed documentation on each of the available operations can be found here; this discussion of procs and the documentation of the proc-related portions of the API will make use of just two operations:

Procs are processes

Example (complete) Proc object

{
  "type": "proc",
  "id": "proc_1554e77a1ee",
  "created": "2017-05-28T12:37:53Z",
  "source_tags": ["acquired:2017-05-28", "january-resumes"],
  "operations": [{
    "op": "metadata"
  }, {
    "op": "images"
  }],
  "status": "complete",
  "docids": ["doc_8e96ec0033ac3e1e988b7d1ca27bfdc096b82ddc"],
  "documents": [{
    "type": "doc",
    "id": "doc_8e9600cd7db5baf2fad83e4d8b48359678b24322",
    "filename": "8e9600cd7db5baf2fad83e4d8b48359678b24322.pdf",
    "tags": ["acquired:2016-07-01"],
    "created": "2016-07-01T18:46:21Z",
    "expires": "2016-08-01T18:46:21Z",
    "results": [{
      "op": "metadata",
      "data": {
        "Title": "C:\\user\\workspace\\test.txt",
        "ModDate": "2009-11-11T07:27:58Z",
        "Producer": "Acrobat Web Capture 9.0",
        "CreationDate": "2009-11-11T07:12:00Z"
      }
    }, {
      "op": "images",
      "data": [{
        "type": "page",
        "images": [{
          "type": "img",
          "bounds": [62.362, 248.541, 72.362, 255.541],
          "resource": "rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a"
        }, {
          "type": "img",
          "bounds": [62.362, 173.397, 63.362, 174.397],
          "resource": "rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72"
        }],
        "dimensions": [595, 839],
        "pagenum": 0
      }],
      "resources": {
        "rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a": {
          "id": "rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a",
          "format": "png",
          "mimetype": "image/png",
          "dimensions": [10.0, 7.0]
        },
        "rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72": {
          "id": "rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72",
          "format": "png",
          "mimetype": "image/png",
          "dimensions": [1.0, 1.0]
        }
      }
    }]
  }]
}

Procs are long-running, inherently asynchronous things. When you create a proc through the PDFDATA.io API, our services will get to work on it right away. But, it may take some time to complete such that you can receive the resulting extracted data: sometimes seconds, but often many minutes depending on the "size" of the proc (how many source documents and how many data extraction operations are included).

As described below, the proc API will return data extraction results as part of the initial response to a request to create a new proc when possible. This is nice when it happens, but the potentially long-running nature of procs means that much of the time (and probably always in a production context), you should expect to make at least two API calls for each proc: one to configure and create a proc, and one (or more) to check the status of the proc and retrieve its extracted data.

Proc object attributes
type

The type of this object, always "proc".

id

Each proc is assigned a unique identifier when it is created.

created

The time when this proc was created.

source_tags

A collection of string tags associated with this document. These tags can be added when documents are provided to PDFDATA.io, and used later to start procs over groups of documents that share a given tag. If this proc's set of source documents were specified by tags, they will be recorded here.

operations

A collection of specifications of data extraction operations that the proc will or has applied to its source documents.

status

Either "pending" or "complete", indicating whether a proc's results are available or not. This attribute determines the presence of the docids and documents attributes.

docidsoptional

Present only when a proc's status is "pending", this will be a collection of the (string) IDs of the documents included in the proc.

documentsoptional

When a proc's status is "complete", this will be a collection of the document objects included in this proc. Each document object will also carry an additional results attribute, an array of extracted data corresponding to (and in the same order as) the operations specified when the proc was created.

All proc operations are mediated through the /v1/procs resource and its descendants.

Creating procs

Definition

POST https://api.pdfdata.io/v1/procs
pdfdata.procs.configure({operations: {ARRAY_OF_OPERATION_SPECS},
                         [file: {ARRAY_OF_PDF_PATHS},]
                         [tag: {ARRAY_OF_TAGS},]
                         [docid: {ARRAY_OF_DOCUMENT_IDS},]
                         [wait: {SECONDS}]})
    .start()

pdfdata.procs.configure()
    .operations({ARRAY_OF_OPERATION_SPECS})
    .withFiles({ARRAY_OF_PDF_PATHS})
    .withTags({ARRAY_OF_TAGS})
    .withDocuments({ARRAY_OF_DOCUMENT_IDS})
    .start()

Example request: creating a proc over new source documents being uploaded

curl https://api.pdfdata.io/v1/procs \
  -u test_xkH4xlrO7K80J5CTwEjBeSo6: \
  -F file=@{PATH_TO_PDF} \
  -F file=@{PATH_TO_PDF2} \
  -F operations='[{"op":"metadata"}]'
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.procs.configure()
    .withFiles(["{PATH_TO_PDF}", "{PATH_TO_PDF2}"])
    .operations([{op:"metadata"}])
    .start()
    .then(function (response) {
        // handle response
    })
    .catch(function (error) {
        // handle failure
    });

Example request: creating a proc over documents identified by ID

curl https://api.pdfdata.io/v1/procs \
  -u test_xkH4xlrO7K80J5CTwEjBeSo6: \
  -d docid={DOCUMENT_ID} \
  -d docid={DOCUMENT_ID2} \
  -d operations='[{"op":"metadata"}]'
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.procs.configure()
    .withDocuments(["{DOCUMENT_ID}", "{DOCUMENT_ID2}"])
    .operations([{op:"metadata"}])
    .start()
    .then(function (response) {
        // handle response
    })
    .catch(function (error) {
        // handle failure
    });

Example request: creating a proc over documents selected via tags

curl https://api.pdfdata.io/v1/procs \
  -u test_xkH4xlrO7K80J5CTwEjBeSo6: \
  -d tag={TAG} \
  -d tag={TAG2} \
  -d operations='[{"op":"metadata"}]'
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.procs.configure()
    .withTags(["{TAG}", "{TAG2}"])
    .operations([{op:"metadata"}])
    .start()
    .then(function (response) {
        // handle response
    })
    .catch(function (error) {
        // handle failure
    });

Procs are created by sending a POST request that:

Procs are created by configuring a proc object and then .start()ing it. This configuration:

  1. Declares which source documents should be included in the proc. This can be done by one of:
    • enumerating the IDs of already-uploaded documents
    • providing tags that should be used to select many already-uploaded documents that were previously assigned those tags
    • uploading source PDF documents directly as part of the proc-creation request
  2. Provides specifications of which data extraction operations (plus optional configuration) should be applied to those documents.

The configuration can be provided via an object literal to the .configure() method, or built up by using the fluent "builder"-style API (e.g. .withFiles(["path/to/document.pdf"]) instead of .configure({file: ["path/to/document.pdf"]})), or a combination of both.

Request parameters Configuration options one and only one of docid, file, or tag is required/allowed  
docid optional When provided, must be an ID of an unexpired source document previously provided to PDFDATA.io. Multiple document IDs can be specified via multiple docid request parameters. must be an array of IDs of unexpired source documents previously provided to PDFDATA.io. If any provided document ID is unknown, or refers to an expired document, then the entire request will fail, and no proc will be created. The .withDocuments() method can be used to add document IDs to a proc configuration.
file optional When provided, must be the binary content of a source PDF document. Multiple files can be uploaded via multiple file request parameters. When this parameter is used, the entire request must be encoded as multipart/form-data. must be an array of string paths naming PDF documents. The handling of file parameters and semantics around document objects when creating a proc are identical to those when uploading documents separately. The .withFiles() method can be used to add document IDs to a proc configuration.
tag optional When provided, must be a tagan array of string tags associated with unexpired source documents previously provided to PDFDATA.io. Multiple tags can be specified via multiple tag request parameters. The sets of documents selected by each tag are merged to form the new proc's working set (i.e. tags select documents disjunctively). If none of the provided tags are associated with unexpired source documents, then the request will fail, and no proc will be created. The .withTags() method can be used to add source document tags to a proc configuration.
operations A JSON-encoded collection of descriptions of data extraction operations. Each of the operations will be applied to each of the proc's source documents. The .operations() method can be used to add operations to a proc configuration.
wait optional By default, PDFDATA.io will wait 30 seconds for a proc to complete before issuing a response to the proc-creation request. For very small workloads and simple data extraction operations, this means that many procs can be created and results returned with a single API call.

This parameter allows you specify a shorter period, in seconds. wait parameter values larger than 30 will be ignored. Typical usage would be to provide a wait value of 0, which will cause the API to issue the pending proc response immediately; a later request will then be necessary to check on the proc's status and potentially retrieve results.

See "Procs are processes" for further background / overview of the asynchronous nature of procs.

Procs can be created with document IDs, or document tags, or by uploading new documents, but these options are exclusive (i.e. you cannot upload new documents, and specify tags to select additional previously-uploaded documents to be included in the proc).

When a document is included in a new proc, its expires attribute is "reset" to 30 days from the time of the proc's creation, mirroring the expiration extension that occurs when a document is re-uploaded to PDFDATA.io.

PDFDATA.io provides many, many data extraction operations. The particulars of what configuration options are supported by each operation and the shape of the structured data and flavour of binary resources they extract are all described here.

The proc creation response

Successfully creating a proc will produce an API response containing the new proc object. That response will:

  • include a Location header, the value of which will be the canonical URL for the new proc. That URL can be retrieved later in order to check the proc's status, or retrieve its data extraction results.
  • bear an HTTP 201 (created) status if the proc completes before the response is sent. Otherwise, a status of 202 (accepted) will be sent back, indicating the proc's pending status at the time of the response.

The body of the proc-creation response will be the JSON-encoded proc object itself, equivalent to separately requesting the proc object later via its canonical URL, described next.

Successfully creating and starting a proc will produce the created proc object as a response, equivalent to separately requesting the proc object later via its canonical URL, described next.

Getting the results of a proc

Definition

GET https://api.pdfdata.io/v1/procs/{PROC_ID}
pdfdata.procs.get({PROC_ID});

Example request

curl https://api.pdfdata.io/v1/procs/proc_15555c7e6c2
  -u test_xkH4xlrO7K80J5CTwEjBeSo6:
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.procs.get("proc_15555c7e6c2")
    .then(function (response) {
        // handle response
    })
    .catch(function (error) {
        // handle failure
    });

Example (pending) response

{
  "type": "proc",
  "id": "proc_1555580e8ff",
  "created": "2016-06-15T19:19:19Z",
  "source_tags": [],
  "operations": [{
    "op": "metadata"
  }],
  "status": "pending",
  "docids": [
    "doc_8e9600cd7db5baf2fad83e4d8b48359678b24322",
    "doc_8e96ec0533ac3e1e988b7d1ca27bfdc096b82ddc",
    "doc_a5d8e5d0b99ac891226acb35f24a9f8f8eda50df"
  ]
}

Example (completed) response

{
  "type": "proc",
  "id": "proc_1555579d8df",
  "created": "2016-06-15T19:11:36Z",
  "source_tags": [],
  "operations": [{
    "op": "metadata"
  }],
  "status": "complete",
  "documents": [{
    "type": "doc",
    "id": "doc_8e96ec0533ac3e1e988b7d1ca27bfdc096b82ddc",
    "filename": "8e96ec0533ac3e1e988b7d1ca27bfdc096b82ddc.pdf",
    "tags": ["acquired:2016-06-15"],
    "created": "2016-06-15T19:11:36Z",
    "expires": "2016-07-15T19:15:32Z",
    "results": [{
      "op": "metadata",
      "data": {
        "Creator": "SPDF",
        "Title": "Microarray Gene Expression Data with Linked Survival Phenotypes",
        "Producer": "AppendPro 3.0 Linux 7 SPDF_1085 May 15 2003",
        "Subject": "Center for Bioinformatics & Molecular Biostatistics",
        "ModDate": "2006-01-24T20:38:13Z",
        "Keywords": "Diffuse large-B-cell lymphoma; Gene harvesting; Least angle regression",
        "CreationDate": "2006-01-24T20:38:13Z",
        "Author": "Mark R. Segal",
        "SPDF": 1085,
        "Changes":
        [{"CreationDate": "2006-01-24T20:38:13Z",
          "Producer":     "SPDF",
          "ModDate":      "2006-01-24T20:38:13Z",
          "Creator":      "SPDF"}]
      }
    }]
  }]
}

Proc objects always exist in one of two primary states:

If a new proc is completed before the proc-creation response is issued (the upper bound of which is controlled by the wait parameter), then the response to that request will include the resulting extracted data. Otherwise, you will need to make one or more additional API calls to check the status of the proc and receive its results when it is complete.

This is done by issuing a GET request for the proc's canonical URL, which is included in the Location header of every proc-creation response. (You can also reliably construct this URL for any known proc ID.) The response will always be the single identified proc object.

This is done by calling the pdfdata.procs.get() method with a proc ID string (or a pending proc object), or by calling pdfdata.procs.getCompleted() to obtain a promise that will automatically be fulfilled when the proc is complete, described next.

Waiting for proc completion

Waiting 5 minutes for a proc to complete

var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.procs.getCompleted("proc_15555c7e6c2", 5 * 60 * 1000)
    .then(function (proc) {
        if (proc.status == "complete") {
            // handle completed proc
        } else {
            // proc is still "pending"
        }
    })
    .catch(function (error) {
        // handle failure
    });

"Manually" polling pdfdata.procs.get() periodically to check whether a proc has completed its work is not a pleasant programming task. Therefore, pdfdata-node provides a simpler way to be notified when a proc is completed.

The example to the right demonstrates the use of pdfdata.procs.getCompleted(), which returns a promise that is fulfilled when the identified proc is completed, or when the specified timeout expires.

A call to .getCompleted() can be easily applied to the result of creating a proc, wrapping up that creation and then any waiting for a the proc to finish into a single block of promise calls and handling.

pdfdata.procs.getCompleted() arguments  
procid The string ID of the proc to retrieve (or, a full proc object, e.g. a pending proc produced by a proc creation call)
timeout_ms How long to wait (in milliseconds) for the identified proc to complete. If this period expires, the returned promise will be fulfilled with a still-pending proc object.
polling_interval optional How often to poll the PDFDATA.io API to check on the status of the proc. 1000 is both the default, and the minimum value.

The structure of data extraction results

PDF documents contain a wide range of data types: key/value metadata, on-page character data, bitmap images, form data, vector graphics, and more. PDFDATA.io offers an even wider range of operations that extract this data, some of which reorganize and integrate those fundamental data types in different ways (for example, to selectively extract text based on your own criteria, infer the location and structure of tabular data, rasterize vector graphics, and so on). The structure of a proc's data extraction results will depend entirely on the operations you select and configure when creating it.

The results attribute added to each document object in a completed proc response is always a collection of operation result objects, in the same order they were specified when the proc was created. Operations can produce two kinds of results; depending on what an operation does, it may produce either or both of:

Data may refer to resources within the same operation's results, if the operation produces both. For example, a sub-attribute of the data value may describe the position and location of an image, referring to the the image's corresponding resource, which can be retrieved separately.

Please refer to the documentation for each operation for particulars on what data and/or resources they produce.

Proc and operation failures

Example of failed operation result object

{
  "type": "proc",
  "id": "proc_15651e77af7",
  "created": "2016-08-03T19:35:40Z",
  "source_tags": [],
  "operations": [{
    "op": "metadata"
  }],
  "documents": [{
    "type": "doc",
    "id": "doc_144a4c8c770dba924e924c8ee26099d585c83986",
    "filename": "144a4c8c770dba924e924c8ee26099d585c83986.pdf",
    "tags": ["acquired:2016-08-03"],
    "created": "2016-08-03T19:35:32Z",
    "expires": "2016-09-02T19:35:40Z",
    "results": [{
      "op": "metadata",
      "data": {
        "Title": "Content Categorization Methodologies: The Good, The Bad, The Best, and Why",
        "Author": "Claude Vogel",
        "Creator": "Microsoft Word 9.0",
        "ModDate": "2002-11-08T14:27:29Z",
        "Producer": "Acrobat Distiller 4.05 for Windows",
        "CreationDate": "2002-10-15T12:49:49Z"
      }
    }]
  }, {
    "type": "doc",
    "id": "doc_9662c6d4f7a7eedeb1304688c5767cfa84db067a",
    "filename": "report.png",
    "tags": ["acquired:2016-08-03"],
    "created": "2016-08-03T19:32:44Z",
    "expires": "2016-09-02T19:35:40Z",
    "results": [{
      "op": "metadata",
      "failure": true
    }]
  }],
  "status": "complete"
}

As discussed elsewhere, the PDFDATA.io API may produce errors under various circumstances that are very clear, e.g. returning an HTTP response with a 4xx status code if a request is malformed in some way. However, because procs' work is done asynchronously, and because each proc may coordinate the application of more than one operation to more than one source document, standard HTTP response statuses and client-side error handling isn't sufficient to capture the general case of e.g. a single operation failing to extract the requested data from a single source document, but succeeding with hundreds or thousands of other source documents, all part of the same proc.

In such cases where an operation fails to process a source document, the result object it produces for that one document will not contain any extracted data. Rather, it will contain only the name of the failing operation, and a failure attribute.

In the example to the right, a proc was created applying the metadata operation to two sources; one succeeded, and provided the expected data, but because the second source was not actually a PDF document (an image, actually), the result object produced by the metadata operation indicated the processing failure.

Operations

PDFDATA.io offers a number of different data extraction operations. Each section that follows will detail each operation, providing:

  1. An overview of the operation and the PDF data it extracts.
  2. A "specification" and example of the configuration object that is provided when creating procs, when you Wish To apply an operation to a set of source documents.
  3. Where possible, a specification of the shape of the data extraction results the operation produces, and always an example or two of those results.

Text

Example text extracted using "decompose" layout

 Relaciones comerciales desde :
 Crédito :
 Consumo mensual :
 Máximo consumo :
 Plazo de pago :
 Pago con cheque :
 Pago con transferencia :
 Pago en efectivo :
 Productos que compran :
 Cifras expresadas en :
 Opinión :
 Atraso en pagos :
 Requieren garantías :

 19/08/2014
  2008
 Abierto
 300,000.00
 500,000.00
 90 días
 NO
   SI
 NO
 Equipo hotelero y de restaurante
  Pesos
 Muy Bueno
 No tiene
 NO

Example text extracted using "preserve" layout

                                                     19/08/2014
 Relaciones comerciales desde :                       2008
 Crédito :                                           Abierto
 Consumo mensual :                                   300,000.00
 Máximo consumo :                                    500,000.00
 Plazo de pago :                                     90 días
 Pago con cheque :                                   NO
 Pago con transferencia :                              SI
 Pago en efectivo :                                  NO
 Productos que compran :                             Equipo hotelero y de restaurante
 Cifras expresadas en :                               Pesos
 Opinión :                                           Muy Bueno
 Atraso en pagos :                                   No tiene
 Requieren garantías :                               NO

PDFDATA.io's text operation extracts the text contained in source PDF documents. PDFDATA.io supports all languages, text encodings, character sets, and writing modes, except for right-to-left writing systems like those associated with Arabic, Hebrew, Urdu, and so on.

In order to accommodate different types of documents and expectations of the sorts of content they contain, the text operation provides two different layout options, "preserve" and "decompose".

The difference between "preserve" and "decompose" is best illustrated with an example. Consider the following portion of a PDF document:

The text extracted by PDFDATA.io using the two different layout options is shown to the right. The "decompose" layout mode attempts to produce a "linearization" of all of the text on each page that matches natural human reading order; for example, text laid out in columns will be separated so that each column's content will follow in sequence. This makes "decompose" ideal for use with documents that contain narrative content that you might later subject to indexing, natural language processing analyses, semantic entity labelling, summarization, and so on.

Example tabular text extracted using "preserve" layout

                                        Original              Beginning                                                                             Interest                             Ending
                                       Certificate             Certificate           Principal              Interest      Realized Loss            Shortfall          Total            Certificate
  Class         Cusip                  Face Value             Balance (1)          Distribution         Distribution (2)       of Principal         Amount          Distribution        Balance (1)
  A-1         04541GGN6                   $230,000,000.00              $0.00                   $0.00               $0.00                  N/A             $0.00            $0.00                  $0.00
  A-2         04541GGP1                   $268,000,000.00             ($0.00)                  $0.00               $0.00                  N/A             $0.00            $0.00                  $0.00
  A-3         04541GGQ9                   $128,200,000.00              $0.00                   $0.00               $0.00                  N/A             $0.00            $0.00                  $0.00
  A-IO        04541GGR7                    $60,200,000.00              $0.00                   $0.00               $0.00                  N/A             $0.00            $0.00                  $0.00
  M-1         04541GGS5                    $45,000,000.00      $42,467,274.25                  $0.00          $243,574.81               $0.00             $0.00       $243,574.81       $42,467,274.25
  M-2         04541GGT3                    $37,500,000.00      $37,500,000.00                  $0.00          $228,743.68               $0.00             $0.00       $228,743.68       $37,500,000.00
  M-3         04541GGU0                    $11,250,000.00      $11,250,000.00                  $0.00           $68,623.11               $0.00             $0.00        $68,623.11       $11,250,000.00
  M-4         04541GGV8                    $11,250,000.00       $9,031,924.44            $890,676.02           $55,093.22               $0.00             $0.00       $945,769.24        $8,141,248.42
  M-5        04541GGW6                      $9,370,000.00       $2,766,258.24             $24,577.16            $4,121.36               $0.00         $12,752.35       $28,698.52        $2,741,681.08
  M-6         04541GGX4                     $9,372,000.00       $2,864,767.61             $78,620.39               $0.00                $0.00         $17,474.60       $78,620.39        $2,786,147.22
   P          04541GHA3                          $100.00               $0.00                   $0.00               $0.00                $0.00             $0.00            $0.00                  $0.00
   X          04541GGZ9                            $0.00        $4,770,105.14                  $0.00               $0.00                $0.00             $0.00            $0.00         $4,780,892.59
   R          04541GHB1                            $0.00               $0.00                   $0.00               $0.00                  N/A             $0.00            $0.00                  $0.00
  B-IO        04541GGY2                    $54,000,000.00              $0.00                   $0.00               $0.00                  N/A             $0.00            $0.00                  $0.00
  Total                                   $749,942,100.00     $105,880,224.54            $993,873.57          $600,156.18               $0.00         $30,226.95    $1,594,029.75      $104,886,350.97

In contrast, the (default) "preserve" layout results in extracted text that roughly matches the spatial arrangement of that text on each page. This makes "preserve" ideal for cases where post-processing of PDFDATA.io-extracted text will be used to identify structured data elements, like the label/data pairs in the example above, or regions like this financial disclosure table:

The results of applying the text operation with a layout of "preserve" to this source PDF document yields the well-formatted text shown to the right.

Example operation configuration

{
  "op": "text",
  "layout": "preserve"
}
text operation configuration object attributes
op

Must be "text".

layoutoptional

Either "preserve" or "decompose"; defaults to "preserve".

text operation results consist of a data array containing a page object (one for each page in the source PDF document). Each page object has a string text attribute, the text extracted from that page, according to the layout configuration option specified when the proc was created.

Example result object

{
  "op": "text",
  "data": [{
    "text": "                                                     19/08/2014\n Relaciones comerciales desde :                       2008\n Crédito :                                           Abierto\n Consumo mensual :                                   300,000.00\n Máximo consumo :                                    500,000.00\n Plazo de pago :                                     90 días\n Pago con cheque :                                   NO\n Pago con transferencia :                              SI\n Pago en efectivo :                                  NO\n Productos que compran :                             Equipo hotelero y de restaurante\n Cifras expresadas en :                               Pesos\n Opinión :                                           Muy Bueno\n Atraso en pagos :                                   No tiene\n Requieren garantías :                               NO",
    "type": "page",
    "pagenum": 0,
    "dimensions": [595, 842]
  }]
}
text Page object attributes
type

The type of this object, always "page".

pagenum

The source document's page number, zero-indexed.

dimensions

The [width, height] of this object.

text

The text extracted from page pagenum, using the layout indicated in the operation's configuration.

Interactive form data

Filling an interactive PDF form in Acrobat

In addition to conveying rendered text and graphics, PDF documents can be used as interactive forms for gathering data. To do this, the PDF specification allows a document to contain any number of form fields, similar to the text, radio button, checkbox, and select form fields found in HTML. In contrast with HTML, the data entered into an interactive PDF form's fields isn't sent to a remote server; rather, that data is saved as part of a new version of the original PDF document. PDFDATA.io's interactive-form operation extracts this data.

Example operation configuration

{
  "op": "interactive-form"
}
interactive-form operation configuration object attributes
op

Must be "interactive-form".

The interactive-form operation will always extract the data for all of the fields in a source document. (A single PDF document only contains one "form", so we're using the terms "document" and "form" interchangeably in this section of the API documentation.) Interactive PDF forms can contain any number of the following sorts of fields:

Types of interactive PDF form fields  
"text" A simple editable text field.
"checkbox" A button with an associated on/off state.
"radiogroup" A set of buttons where at most one can have an "on" state.
"choice" A set of options — rendered as a drop-down menu or scrolling list — where one or more may be selected.

Example interactive-form result object

{
  "op": "interactive-form",
  "data": [
    {
      "name": "f1-1",
      "value": "John Doe",
      "bounds": [86.6656, 699.602, 590.658, 714.268],
      "pagenum": 0,
      "fieldtype": "text"
    },
    {
      "name": "f1-5",
      "value": "New York, NY 10001",
      "bounds": [86.33229, 603.26984, 410.66068, 617.9363],
      "pagenum": 0,
      "fieldtype": "text"
    },
    {
      "name": "f1-4",
      "value": "123 Canal St.",
      "bounds": [86.66562, 627.2695, 409.994, 641.9359],
      "pagenum": 0,
      "fieldtype": "text"
    },
      
    ....

    {
      "name": "c1-2",
      "value": "/Yes",
      "bounds": [251.663, 659.602, 259.996, 667.269],
      "checked": false,
      "pagenum": 0,
      "fieldtype": "checkbox"
    },
    {
      "name": "c1-1",
      "value": "/Yes",
      "bounds": [171.664, 658.602, 181.331, 668.269],
      "checked": true,
      "pagenum": 0,
      "fieldtype": "checkbox"
    }
  ]
}

Each form field found in a document is represented as a field object in the data array of the interactive-form operation's result object. If a source document does not contain any fields, then the result object's data attribute will be an empty array.

All form fields, regardless of type, may carry a number of common attributes:

Common form field object attributes
fieldtype

The type of this field, one of "checkbox", "radiogroup", "text", or "choice".

name

The form field's name, unique within its source document/form. This name is set by the generator of the source PDF document, and generally will have no identifiable meaning.

mapping_nameoptional

The form field's "mapping" name, used to identify the field in exported form data formats. If this was set, it might correspond with a schema element name, database column, etc.

ui_nameoptional

The form field's human-friendly name that might be used for display to users, accessibility tools, and other interface contexts.

defaultoptional

The field's default value, a string.

pagenum

The source document's page number, zero-indexed.

bounds

A description of a rectangular bounding box's coordinates.

In addition to these base attributes, some additional attributes will be present depending on the type of field:

Additional text form field object attributes
value

A string, the "plain text" value of this text field, may be null.

xhtml_valueoptional

The value of this text field, represented as a string containing XHTML content. Note: it is rare for interactive PDF forms to offer these sorts of "rich-text" fields.

Additional checkbox form field object attributes
checked

true or false, indicates whether the checkbox field is checked/activated or not.

value

The "current value" of this checkbox field. Checkboxes in interactive PDF forms (should) always have an associated value, even when they are unchecked. When unchecked, this value should be "/Off"; when checked, this value can be anything (it is determined when the form is created / generated), and might have some connection to the semantics of the checkbox within the form. A common checked value is "/Yes", but other observed examples include "/Exempt", "/Pass", or "/USD".

We generally recommend that your applications identify the semantics of each checkbox form field independently and rely upon the extracted form field's checked flag, since checkbox values are not always a reliable indicator.

Additional radiogroup form field object attributes
options

An array of strings enumerating the possible values this radiogroup field may have.

value

This radiogroup's value, a string selected from the options array. If, when the form was authored, the radiogroup was configured to allow no selection, then value may be null.

Additional choice form field object attributes
values

An array of strings, the options selected in this choice field. Possible options are enumerated in the "keys" of the options object. Note that it is possible for choice fields to accept a user-entered value that does not appear in options (an choice made by the author of the form).

options

An object enumerating the possible values this choice field can carry, with string "keys" (corresponding to possible values) mapping to human-readable descriptions of those possible values, e.g. `{"V": "Visa", "MC": "Mastercard", "AMEX": "American Express"}`. Note that many PDF forms omit the separate human-readable descriptions, so it's possible to encounter options objects like e.g. `{"V": "V", "MC": "MC", "AMEX": "AMEX"}`.

Consuming PDF form data

There are two main strategies for consuming PDF form data:

  1. If you know what the form fields' names are ahead of time (maybe because you have control over the form's generation and can control them, or perhaps because you have already determined the correspondence between each field name and your application's data model), then you can simply loop through each form field, consume its state (value, values, and checked attributes, depending on the field type), and apply it along with the field's name (or mapping_name, as appropriate) to your data model.
  2. If the forms you are consuming have meaningless name attributes (e.g. "c1-1"), then you will have to identify each form field with regard to your application's data model. If the forms you're consuming use a stable set of field names, then you can then proceed according to strategy (1). If field names change from form to form (a possibility if e.g. you are consuming IRS 1040 forms generated by many different PDF producers), then you would be better off to use the form field location information that the interactive-form operation provides (pagenum and bounds) to usefully identify each form field extracted from each document.

Bitmap images

Example operation configuration

{
  "op": "metadata"
}

Many PDF documents include graphics by way of embedding images of the bitmap/raster variety typically encoded and exchanged via JPEG, PNG, TIFF, and similar formats. The images operation extracts these bitmaps and their rendered location from source PDF documents.

images operation configuration object attributes
op

Must be "images".

Example images result object

{
  "op": "images",
  "data": [{
    "type": "page",
    "dimensions": [595, 839],
    "pagenum": 0,
    "images": [{
      "type": "img",
      "bounds": [62.362, 248.541, 72.362, 255.541],
      "resource": "rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a"
    }, {
      "type": "img",
      "bounds": [62.362, 173.397, 63.362, 174.397],
      "resource": "rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72"
    }]
  }],
  "resources": {
    "rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a": {
      "url": "/v1/resources/rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a",
      "format": "png",
      "mimetype": "image/png",
      "dimensions": [10.0, 7.0]
    },
    "rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72": {
      "url": "/v1/resources/rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72",
      "format": "png",
      "mimetype": "image/png",
      "dimensions": [1.0, 1.0]
    }
  }
}

The images operation produces result objects that enumerate each page of the source document where images were found, which contain an array of objects (one for each image) that provide exact location and size information for the image and refer to its corresponding resource.

Specifically, the data attribute is a listing of each page of the source document which contains images. Those page objects enumerate each rendered image in their images array attribute, and each image object describes the image's position and extent via bounds, referring to the corresponding resources that hold the actual bitmap data via a resource attribute.

images Page object attributes
type

The type of this object, always "page".

pagenum

The source document's page number, zero-indexed.

dimensions

The [width, height] of this object.

images

An array of img objects, for each image found on the page.

img object attributes
type

The type of this object, always "img".

bounds

A description of a rectangular bounding box's coordinates. The dimensions implied by these bounds may be different than the physical dimensions of the referenced image resource.

resource

The ID of the resource associated with this object. Further data about the resource — including a url by which the resource's binary data can be retrieved — can be found within the same result object's resources object, keyed by this ID, e.g. result.resources["rsrc_5b9477c374e62a348f4423956684b028fe9acb95"].

In addition to the standard mimetype and url attributes carried by all resource objects, resources produced by the images operation include format and dimensions attributes that describe the resource's bitmap data itself (i.e. a PNG image with dimensions of [100, 100] might be rendered on different pages of a source document scaled to different sizes, e.g. [50, 50], or [10, 100]).

images operation's additional resource object attributes
format

The file extension associated with the image's data. The images operation currently produces image data encoded in either JPEG or PNG formats.

dimensions

The [width, height] of this object. This is the actual "physical" dimensions of the bitmap image, which may be different than the dimensions of the bounds where the image is rendered on pages in the source PDF.

Please refer to this guide's section on resources for a treatment on how the resources attribute in operation results are organized, and how you can retrieve resources' data via the PDFDATA.io API.

Document metadata

Most PDF documents contain a set of simple key/value metadata that often includes baseline information like a document's title, author, creation date, keywords, and what program or system generated the document. The metadata operation extracts this data.

Example operation configuration

{
  "op": "metadata"
}

Example metadata result object

{
  "op": "metadata",
  "data": {
    "Creator": "SPDF",
    "Title": "Microarray Gene Expression Data with Linked Survival Phenotypes",
    "Producer": "AppendPro 3.0 Linux 7 SPDF_1085 May 15 2003",
    "Subject": "Center for Bioinformatics & Molecular Biostatistics",
    "ModDate": "2006-01-24T20:38:13Z",
    "Keywords": "Diffuse large-B-cell lymphoma; Gene harvesting; Least angle regression",
    "CreationDate": "2006-01-24T20:38:13Z",
    "Author": "Mark R. Segal",
    "SPDF": 1085,
    "Changes":
    [{"CreationDate": "2006-01-24T20:38:13Z",
      "Producer":     "SPDF",
      "ModDate":      "2006-01-24T20:38:13Z",
      "Creator":      "SPDF"}]
  }
}
metadata operation configuration object attributes
op

Must be "metadata".

PDF document generators can opt to include any other metadata they choose in PDFs they produce, beyond the common attributes enumerated below. Though there are no universal conventions for the keys used to convey such additional metadata (whereas there is a universal convention defining the keys for the common baseline metadata), there are domain-specific metadata key conventions (e.g. within prepress, legal publishing, and other fields that often enrich document metadata beyond the baseline keyset).

Common document metadata attributes  
Title The document's title.
Author The name of the person who created the document.
Subject The document's subject.
Keywords Keywords associated with the document, usually comma- or semicolon-delimited.
Creator If the document was converted to PDF from another format, the name of the application that created the original document from which it was converted.
Producer If the document was converted to PDF from another format, the name of the application that converted it to PDF.
CreationDate A timestamp indicating when the document was created.
ModDate A timestamp indicating when the document was most recently modified.

In addition to simple atomic values, document metadata attributes can contain nested collections (arrays and/or objects), such as the Changes attribute in the example metadata result to the right. Such attributes are not common — they are keyed outside of the baseline keyset documented above — but applications should be written so as to accommodate their possibility.

XMP (XML) document metadata

Example operation configuration

{
  "op": "xmp-metadata"
}

Example result object

{
  "op": "xmp-metadata",
  "resources": {
    "rsrc_4218185ed0b3736f47ec787bb4142189cbcb057b": {
      "url": "/v1/resources/rsrc_4218185ed0b3736f47ec787bb4142189cbcb057b",
      "mimetype": "application/xml"
    }
  }
}

Example result when source PDF contains no XMP metadata

{
  "op": "xmp-metadata",
  "resources": {}
}

Example XMP metadata resource

This is a small example of the content of an extracted XMP metadata resource. Refer to the expansive XMP metadata specifications for a proper reference.

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/">
         <xmp:ModifyDate>2009-11-11T16:27:58+09:00</xmp:ModifyDate>
         <xmp:CreateDate>2009-11-11T16:12+09:00</xmp:CreateDate>
         <xmp:MetadataDate>2009-11-11T16:27:58+09:00</xmp:MetadataDate>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">C:\user\workspace\test.txt</rdf:li>
            </rdf:Alt>
         </dc:title>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
         <xmpMM:DocumentID>uuid:b5582405-9d15-42f0-9eeb-80480c76463b</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:6affcf96-d90e-4900-82f4-7c90463c50cd</xmpMM:InstanceID>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Producer>Acrobat Web Capture 9.0</pdf:Producer>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

Some PDFs contain XMP metadata, an alternative representation of document metadata that is encoded as XML. The xmp-metadata operation extracts this XML and provides it as a resource.

If no XMP metadata is found in the source document, then the resources attribute of the operation result will be an empty object.

xmp-metadata results carry no data attribute.

xmp-metadata operation configuration object attributes
op

Must be "xmp-metadata".

Document attachments

Example operation configuration

{
  "op": "attachments"
}

Example result object

{
  "op": "attachments",
  "data": [{
    "title": "Attachment1",
    "bounds": [100.0, 722.0, 120.0, 742.0],
    "pagenum": 0,
    "location": "..//..//two_pilots.bmp",
    "resource": "rsrc_5b9477c374e62a348f4423956684b028fe9acb95"
  }, {
    "title": "Attachment2",
    "bounds": [100.0, 622.0, 120.0, 642.0],
    "pagenum": 0,
    "location": "..//..//License.rtf",
    "resource": "rsrc_13ffe09daa1dd4a67b2ed8cec6e7a5fcc8c5156b"
  }],
  "resources": {
    "rsrc_13ffe09daa1dd4a67b2ed8cec6e7a5fcc8c5156b": {
      "url": "/v1/resources/rsrc_13ffe09daa1dd4a67b2ed8cec6e7a5fcc8c5156b",
      "mimetype": "application/octet-stream"
    },
    "rsrc_5b9477c374e62a348f4423956684b028fe9acb95": {
      "url": "/v1/resources/rsrc_5b9477c374e62a348f4423956684b028fe9acb95",
      "mimetype": "application/octet-stream"
    }
  }
}

Attachment resources

PDF document attachments can be anything, just like email attachments.

Just like emails, PDF documents can contain attachments, separate files that are not displayed as part of the PDF document. The attachments operation extracts all attachments found within a PDF document and provides them as resources, along with whatever metadata is available for each attachment.

attachments operation configuration object attributes
op

Must be "attachments".

In addition to the raw attachment data linked to in the resources object in each result, the data attribute's array contains an object providing metadata for each attachment.

Attachment object attributes
descriptionoptional

A description of the attachment or its contents, for human consumption.

locationoptional

Typically a filesystem path or URL, indicating from where the attachment was sourced.

resourceoptional

The ID of the resource associated with this object. Further data about the resource — including a url by which the resource's binary data can be retrieved — can be found within the same result object's resources object, keyed by this ID, e.g. result.resources["rsrc_5b9477c374e62a348f4423956684b028fe9acb95"].

titleoptional

A further description of the attachment or its contents. Only present if an attachment is pinned to a particular location within the source document.

pagenumoptional

The source document's page number, zero-indexed. Only present if an attachment is pinned to a particular location within the source document.

boundsoptional

A description of a rectangular bounding box's coordinates. Only present if an attachment is pinned to a particular location within the source document.

PDF document attachments can optionally be "pinned" to a location within the source document, usually represented as a "push-pin" annotation on a particular page. In this case, an attachment's data object will include pagenum, bounds, and title attributes that describe the location and presentation of their associated annotation.

Note that PDF document attachments do not explicitly indicate their content type or mimetype, so all attachment resources will have a mimetype of "application/octet-stream". The location attribute of each attachment's data object will usually include a filename (sometimes along with other path information from when the file was attached to the PDF) that will include a file extension (like .bmp or .rtf as in the example result to the right) that you can use to infer the attachment's file or content type.

It is possible (though very rare) for a PDF document attachment to not actually carry attachment data. In this case, the attachment location will typically be a URL referring to an "attachment" located elsewhere, and the attachment object will lack a resource attribute.

If no attachments are found in a source document, then the resources and data attributes of the operation result will be empty.

Resources

Operation result object linking to extracted resource

{
  "op": "xmp-metadata",
  "resources": {
    "rsrc_4218185ed0b3736f47ec787bb4142189cbcb057b": {
      "url": "/v1/resources/rsrc_4218185ed0b3736f47ec787bb4142189cbcb057b",
      "mimetype": "application/xml"
    }
  }
}

Operation result object including references to resources from data entities

{
  "op": "images",
  "data": [{
    "type": "page",
    "dimensions": [595, 839],
    "pagenum": 0,
    "images": [{
      "type": "img",
      "bounds": [62.362, 248.541, 72.362, 255.541],
      "resource": "rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a"
    }, {
      "type": "img",
      "bounds": [62.362, 173.397, 63.362, 174.397],
      "resource": "rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72"
    }]
  }],
  "resources": {
    "rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a": {
      "url": "/v1/resources/rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a",
      "format": "png",
      "mimetype": "image/png",
      "dimensions": [10.0, 7.0]
    },
    "rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72": {
      "url": "/v1/resources/rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72",
      "format": "png",
      "mimetype": "image/png",
      "dimensions": [1.0, 1.0]
    }
  }
}

Some operations extract data which cannot be represented as a structured JSON value, e.g. bitmap images, attachments, and so on. Within the PDFDATA.io API, these sorts of data entities are called resources. If an operation produces resources, the results it delivers via completed proc objects will include a resources attribute, a JSON object that links to and provides metadata for all of the resources extracted by the operation. Each resource's binary data is then obtained via a separate API request to the provided URL.

Base resource object object attributes
url

The (relative) URL where this resource's data may be retrieved via an HTTP GET request.

mimetype

The MIME type of this resource. It will ideally provide a useful and accurate indication of the resource's data's format, but depending on the operation and the particulars of how a source PDF document encodes the resource, it is possible for this to be application/octet-stream.

Operations may "enrich" the base set of resource object attributes of url and mimetype with additional metadata that pertains strictly to the resources' data (again, as in the case of images results).

Some operations' results will only include resources (like xmp-metadata); others will include further information about the use of those resources in the source PDF documents via the data attribute (like images). In these cases, objects referring to resources will do so by providing the resource's ID via a resource attribute, which can then be used to perform a lookup on the resources object from the same operation result.

Retrieving resources

Definition

GET https://api.pdfdata.io/{RESOURCE_URL}
pdfdata.resources.byID({RESOURCE_ID});

pdfdata.resources.byURL({RESOURCE_URL});

Example request

curl https://api.pdfdata.io/v1/resources/rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72
  -u test_xkH4xlrO7K80J5CTwEjBeSo6:
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.resources.byURL("/v1/resources/rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72")
    .then(function (resourceMessage) {
        // handle http.IncomingMessage
    })
    .catch(function (error) {
        // handle failure
    });

pdfdata.resources.byID("rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72")
    .then(function (resourceMessage) {
        // handle http.IncomingMessage
    })
    .catch(function (error) {
        // handle failure
    });

Example response

< HTTP/1.1 200 OK
< Content-Type: application/xml
< Content-Disposition: attachment; filename="8e9600cd7db5baf2fad83e4d8b48359678b24322-xmp.xml"
< content-length: 3468
< 

... resource's data follows ...

When you've identified the resource you'd like to retrieve, send an HTTP GET request to the url indicated in that resource's object. Those URLs are relative, meaning that they can be easily resolved against e.g. the URL you used to create and/or retrieve the completed proc object that included the extracted data and resources, or against the PDFDATA.io API root directly (https://api.pdfdata.io).

When you've identified the resource you'd like to retrieve, call either pdfdata.resources.byID() or pdfdata.resources.byURL() with the resource's ID or url, respectively. The result will be a promise that will be fulfilled with an http.IncomingMessage object. Its attributes will be unmodified from those corresponding to the HTTP response issued by the PDFDATA.io API; its body attribute will be a Buffer containing the resource's data.

A couple of headers included in resource responses are notable: