Introduction
PDFDATA.io is an HTTP API that provides PDF data extraction as a service. We make it easy for you to access the data you care about that is held within PDF documents, in a way that can be readily integrated into your applications and services. More concretely, working with PDFDATA.io consists of:
- Providing access to your PDF source documents
- Describing which data from those documents should be extracted
- Retrieving the extracted data
Doing this is accomplished by interacting with the PDFDATA.io HTTP API, which incorporates many of the best principles of REST design:
- resource-oriented URLs and standard HTTP verbs
- using HTTP response codes to communicate success and error conditions
- using common HTTP authentication mechanisms
- returning JSON in all API responses (except for those related to extracted binary data)
Choose your preferred language above
The examples in this documentation are currently shown in
curl
, the command-line HTTP client. JavaScript for Node.js, usingpdfdata-node
, our Node.js client library. Java, usingpdfdata-java
, our client library for Java / JVM applications.If you haven't already, choose your preferred programming language at the top of this page to show examples suited for that language.
While we provide "native" client libraries for multiple programming languages,
this design makes it straightforward to interact with the API using any
language's or tool's modern HTTP client. If we do not (yet!) have a client
library available for your preferred programming language, this reference can
show examples and samples using curl
to the right; analogous usage can be
readily constructed using similar tools, or in any language using its HTTP
client library.
Setup
curl
is probably available via your operating system's package manager, or
direct download.
While you can use curl
itself to interact with PDFDATA.io, these
examples are provided in part because curl
invocations are widely understood
as representations of HTTP requests. If we do not yet provide a client library
for your programming language, you should be able to use these example curl
interactions to guide your integration with the PDFDATA.io API using your
language's HTTP client implementation.
Install via npm:
npm install pdfdata
Source for
pdfdata-node
is available via GitHub.
pdfdata-node is our Node.js client for
PDFDATA.io. It is idiomatic, promise-based JavaScript library that any Node
developer can have up and running in a minute or two. Add it to your
project's package.json
file, or npm install pdfdata
to start working with it.
pdfdata-java requires JDK 1.8 or greater.
Javadoc for pdfdata-java is available for all released versions.
pdfdata-java is available in Maven Central. You can thus easily add it to your project by adding its Maven coordinates to the dependencies section of your build tool's configuration file:
Maven
<dependency> <groupId>io.pdfdata</groupId> <artifactId>pdfdata-java</artifactId> <version>0.9.9</version> </dependency>
sbt
libraryDependencies += "io.pdfdata" % "pdfdata-java" % "0.9.9"
Gradle
compile 'io.pdfdata:pdfdata-java:0.9.9'
Leiningen
[io.pdfdata/pdfdata-java "0.9.9"]
pdfdata-java is our Java / JVM client for PDFDATA.io. It is a lightweight API client library with minimal additional dependencies that any Java or other JVM language developer can have up and running in a minute or two. Add it to you project's dependencies to start working with it.
Authentication & Credentials
Throughout this documentation, sample code will use a dummy test API key:
test_xkH4xlrO7K80J5CTwEjBeSo6
If you login and then reload this page, your test API key will be shown in all code samples instead.
Providing your API credentials
curl https://api.pdfdata.io/v1/ \
-u test_xkH4xlrO7K80J5CTwEjBeSo6:
Requests to the PDFDATA.io API must carry credentials, and must be made via HTTPS; any that do not will result in an error.
API requests are authenticated by HTTP Basic Auth. Your API key is the HTTP Basic username; the password is empty / not provided.
As the first argument to the function exported by the
pdfdata
module:var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
Or, you can set the
PDFDATA_APIKEY
environment variable appropriately for your operating system, e.g.:export PDFDATA_APIKEY=test_xkH4xlrO7K80J5CTwEjBeSo6
and then omit the extra argument when requiring
pdfdata
:var pdfdata = require("pdfdata")();
As a constructor argument to the root
io.pdfdata.API
class:API pdfdata = new io.pdfdata.API("test_xkH4xlrO7K80J5CTwEjBeSo6");
Or, you can set the
PDFDATA_APIKEY
environment variable (OR JVM system property) appropriately for your operating system, e.g.:export PDFDATA_APIKEY=test_xkH4xlrO7K80J5CTwEjBeSo6
OR
java -DPDFDATA_APIKEY=test_xkH4xlrO7K80J5CTwEjBeSo6 [...remaining application arguments...]
and then omit the extra argument when requiring
pdfdata
:io.pdfdata.API pdfdata = new io.pdfdata.API();
Authenticating your requests to the PDFDATA.io API requires providing your credentials, an API key associated with your account. You can manage your API keys in the dashboard. While all communications between your application and PDFDATA.io are secured by HTTPS / TLS, API keys enable access to your source documents and extracted data, so be sure to keep them secret! Do not leave your API keys in source control repositories, client-side code, and other widely-accessible areas. We will honor any API request that includes your credentials as being authorized by you, so protect them accordingly.
There are two classes of API keys that can be associated with a PDFDATA.io
account, test and live. Which one you use will control what
limits will be applied to your usage of the API, and what
charges (if any) will result from that usage. Each class of API key is easily
identified by its prefix (either test_
or live_
).
The pdfdata
node module exports a single function that will accept your API
key as its sole argument, or source your API key from your program's environment
variables in order to produce a usable client object.
Your API key can be provided to the pdfdata-java
client library either as an
argument to one of the constructors on the primary io.pdfdata.API
class, or by
setting an environment variable (or JVM system property) appropriately.
Quick Start
Extracting metadata from a single PDF document
curl https://api.pdfdata.io/v1/procs \
-u test_xkH4xlrO7K80J5CTwEjBeSo6: \
-F file=@test.pdf \
-F operations='[{"op":"metadata"}]'
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.procs.configure()
.withFiles(["test.pdf"])
.operation({op:"metadata"})
.start()
.then(console.log)
API pdfdata = new io.pdfdata.API("test_xkH4xlrO7K80J5CTwEjBeSo6");
Proc proc = pdfdata.procs().configure()
.withFiles("test.pdf")
.withOperations(new Metadata())
.start();
Example extracted metadata response
io.pdfdata.model.Proc JSON {
"type": "proc",
"id": "proc_1555579d8df",
"created": "2016-06-15T19:11:36Z",
"source_tags": [],
"operations": [{
"op": "metadata"
}],
"status": "complete",
"documents": [{
"type": "doc",
"id": "doc_8e96ec0533ac3e1e988b7d1ca27bfdc096b82ddc",
"filename": "test.pdf",
"tags": [],
"created": "2016-06-15T19:11:36Z",
"expires": "2016-07-15T19:15:32Z",
"results": [{
"op": "metadata",
"data": {
"Creator": "SPDF",
"Title": "Microarray Gene Expression Data with Linked Survival Phenotypes",
"Producer": "AppendPro 3.0 Linux 7 SPDF_1085 May 15 2003",
"Subject": "Center for Bioinformatics & Molecular Biostatistics",
"ModDate": "2006-01-24T20:38:13Z",
"Keywords": "Diffuse large-B-cell lymphoma; Gene harvesting; Least angle regression",
"CreationDate": "2006-01-24T20:38:13Z",
"Author": "Mark R. Segal"
}
}]
}]
}
Accessing common metadata entries
Metadata.Result md = ((Metadata.Result) proc.getDocuments().get(0).getResults().get(0)); md.getTitle(); // "Microarray Gene Expression Data with Linked Survival Phenotypes" md.getCreationDate(); // "2006-01-24T20:38:13Z" (a java.time.Instant object) // ...etc...//
Once your environment is set up, you can get results from PDFDATA.io very quickly and with just a couple lines of code.
Nearly every PDF document contains metadata, and PDFDATA.io provides an operation that extracts it. Applying that operation to a sample PDF document you have available is the "Hello World" of PDFDATA.io integration and usage.
The code snippet shown here does this in a single request by
- uploading a source document named
test.pdf
located in the current directory - configuring a single data extraction operation
- creating a proc, which will apply that operation to the source document
Because the metadata extraction operation is very lightweight and requires relatively little processing, a proc consisting only of applying it to a single document will almost always complete before the default timeout for the request's response expires, so that response will almost always contain the completed extraction results.
Once you've completed this "Hello World" sort of task, you'll want to learn about which operations you'll need to get the data you want from your source PDFs, how to work with documents separately from procs, and how procs are long-running processes, which carries significant implications for how you'll interact and integrate with the PDFDATA.io API in a production environment.
Security and Privacy
The PDFDATA.io API is built to provide exceptional PDF data extraction quality, including features not readily available elsewhere, packaged so as to be easily usable in any modern programming language and application context. We can do this only because the API is primarily a managed, hosted service provided by us.
This means that, when you use PDFDATA.io, your source PDF documents and the data it extracts from them are necessarily held by us for a time. We take this responsibility very seriously, and thus apply the same set of security standards, engineering practices, and policies to every PDFDATA.io user, regardless of service level:
- Your data is yours. It is never sold to any other organization, and is never shared with outside organizations, except in the case of computing infrastructure vendors we use to help us provide and operate PDFDATA.io.
- Data you provide to us (either uploaded PDF documents, or the data we extract from them for you) is always transmitted securely (encrypted via HTTPS).
- Further, all data you provide to us automatically expires, i.e. is purged from our systems. This expiration defaults to 30 days for uploaded PDF documents, and 90 days for extracted data.
Usage limits
Your usage of the API is governed by whether or not you have activated a PDFDATA.io service plan, and which class of API key you use to authenticate requests to the API.
PDFDATA.io API usage is moderated only in terms of the number of PDF documents or PDF document pages from which you extract data within a given month. Different data extraction operations have different billing methods, based on their relative complexity and value. Please consult the descriptions of available plans for details, or your dashboard to see exactly what your activated service plan's pricing is, and what limits your account's API keys have. What follows here is general guidance only.
Test and Live API keys
When you register (for free) on PDFDATA.io, you will only have access to a test API key in your account at first, with very stringent usage limits. The intent of this is to allow new visitors to PDFDATA.io to familiarize themselves with how the service and API works.
After you activate a PDFDATA.io service plan, your account is provisioned a live API key (with pricing and limits set according to your chosen plan). In addition, your test API key's limits are extended to roughly 5% of your plan's full allotment of PDF extraction ops per month.
For example, if your chosen plan includes 5,000 ops / month, usage authenticated
by your live API key will draw down that allotment, and additional charges may
occur if usage exceeds that allotment within a service plan month. Meanwhile,
usage authenticated by your test API key will be limited to ~250 ops / month;
when this limit is reached, then additional new proc requests will be refused
with usage_limit_reached
error responses. The same will happen
if usage limits are reached prior to service plan activation, or if your service
plan is allowed to lapse (for example, if your chosen billing/payment method
fails).
The intent of this is to allow you and your team to have a way to go about development and testing of your applications and systems with a dedicated PDFDATA.io API key that doesn't impact your service plan's production allotment or produce additional billing in the case of erroneous test & development activity.
Errors
PDFDATA.io's HTTP API uses standard HTTP status codes and explanatory JSON bodies to indicate API request failures. All API-level errors can be broadly separated into problems with the request (these will provoke responses with 4xx status codes), or unexpected problems on our part in processing the request (resulting in a 5xx status code).
Example error response
io.pdfdata.APIException JSON {
"error": {
"type": "malformed_parameter",
"message": "One of the operation specifications you provided is not valid. See our documentation and get help at https://www.pdfdata.io/help",
"invalid_operations": [{
"op": "page-templates"
}],
"request": {
"method": "POST",
"uri": "/v1/procs",
"headers": {
"via": "1.1 vegur",
"host": "api.pdfdata.io",
"user-agent": "curl/7.47.0",
"content-type": "multipart/form-data; boundary=------------------------dc4bbdda50dceee3",
"content-length": "45192",
"connect-time": "1",
"total-route-time": "0",
"x-request-start": "1498653967467",
"connection": "close",
"accept": "*/*",
"authorization": "Basic xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
},
"params": {
"file": {
"filename": "edf37c1d61c99583f1e0be5a0999a6ac9b2d75bd.pdf",
"content_type": "application/octet-stream",
"size": 44814
},
"operations": "[{\"op\":\"metadata\"},{\"op\":\"page-templates\"}]"
}
}
}
}
When an error occurs when interacting with the PDFDATA.io API, pdfdata-java
will throw a io.pdfdata.APIException
that provides access to the "raw" JSON
error response, as well as access to the data that produced the failed API
request (including request headers, parameters, and so on).
PDFDATA.io error responses are JSON documents that describe the nature of the
error as much as possible, and echo back the contents of the offending request.
The full error response body shown to the right here demonstrates the basic
form: an error
object, which contains
Common Error object attributes | ||
---|---|---|
request | An object representing the contents of the request that prompted this error response. It will always include method , uri , headers , and parameters attributes, corresponding to the essential components of HTTP requests. |
|
type | A string indicating the class of error, with more specificity than is possible with standard HTTP response status codes. The various error type s produced by the PDFDATA.io API are enumerated and described in the Error types table below. |
|
message | A friendly description of the problem that led to the error being raised. |
Many types of errors include additional attributes to more fully describe why the error occurred and aid in debugging.
Error types | Indication |
---|---|
missing_parameter |
A necessary request parameter was not included in the request. |
invalid_parameter |
One of the request parameters, while properly formatted or encoded, is incorrect; for example, if a source document id is provided when starting a new proc that corresponds with no known source document. |
malformed_parameter |
One of the request parameters was not formatted or encoded properly. |
usage_limit_reached |
A request to start a new proc was rejected because the new proc would exceed the monthly usage limit associated with the provided API credentials. See Usage limits for details on when and why a usage limit will prompt an error response. |
document_expired |
A request to retrieve the contents of a source document could not be satisfied, because the document expired out of PDFDATA.io's storage. See Document Lifecycle for details on source document expiration. |
resource_expired |
A request to retrieve the contents of an extracted resource could not be satisfied, because the resource expired out of PDFDATA.io's storage. |
document_not_found |
A requested source document could not be located. Check the validity of the source document id. |
resource_not_found |
A requested extracted resource could not be located. Check the validity of the resource id. |
unknown_api_uri |
The request was for an invalid API URI. |
apidoc_fake_api_key |
The API credential provided was the "dummy" test API key shown above when this page is loaded to anonymous visitors. Use your real API credentials (available on your dashboard, or throughout this reference's code samples if you load the apidoc while logged in) to resolve this "error". |
Core Concepts & Types
As outlined in the introduction, working with PDFDATA.io consists of:
- Providing access to your PDF source documents
- Describing which data from those documents should be extracted via declarative representations of the corresponding operations
- Requesting that PDFDATA.io apply those operations to some set of your document, forming a proc
- Retrieving the data extracted by the proc; depending on the selected operations, that data might be structured JSON, while other data might be binary resources
These bolded terms are the four fundamental concepts represented in PDFDATA.io's API, as well as the four primary "types" of objects you'll encounter when working with the API; an in-depth discussion of them dominates the remainder of this guide. There are a couple of additional data shapes worth discussing here:
Timestamps
The only atomic value understood and produced by PDFDATA.io aside from those
defined by JSON itself are timestamps. Timestamps are encoded in
an ISO 8601 string format that corresponds to
the standard JavaScript Date
format:
YYYY-MM-DDTHH:mm:ssZ
Timestamps will always include all components of this string formatting (e.g. timestamps will never omit the time or seconds components), and will always be in UTC.
When a timestamp is required by the api, the pdfdata-java library will
accept/provide a java.time.Instant
object.
Documents
PDF documents are the raw material from which PDFDATA.io extracts data. You need to get your PDF documents into PDFDATA.io in order for it to be able to read them; this is done by uploading your documents via the API, which create document objects that carry some limited metadata about them. As a shortcut, you can also upload documents, create a proc, and receive the extracted data all in a single request.
Once PDFDATA.io has access to your source documents, you can refer to them using their unique identifiers or associated tags when creating data extraction procs.
Example Document object
{ "type": "doc", "id": "doc_8e96ec0533ac3e1e988b7d1ca27bfdc096b82ddc", "filename": "annual_report.pdf", "tags": ["acquired:2019-12-06"], "created": "2019-12-06T08:49:42Z", "expires": "2020-01-05T08:49:42Z", "pagecount": 12 }
Document object attributes | ||
---|---|---|
type | ||
id | Once in PDFDATA.io, each document is assigned a unique identifier, a cryptographic hash of its contents prefixed with the object's | |
filename | The filename for this document within PDFDATA.io. This will be the same as the filename of the document from when you uploaded it, or when PDFDATA.io obtained it from an integrated storage service. | |
tags | A collection of string tags associated with this document. These tags can be added when documents are provided to PDFDATA.io, and used later to start procs over groups of documents that share a given tag. | |
created | The time when PDFDATA.io first acquired this document. | |
expires | The time when the document itself will expire out of storage. This is calculated to be 30 days from when this document was most recently provided to PDFDATA.io. After that point, the document will not be available for inclusion in new procs until it is provided again. | |
pagecount | The number of pages in this document, determined once when PDFDATA.io first obtains the document. A value of This may be relevant for PDFDATA.io usage metering and billing depending on your service plan. |
All operations over documents are mediated through the /v1/documents
resource
and its descendants.
All operations over documents are mediated through the documents
object
provided by the pdfdata
module.
All operations over documents are mediated through the DocumentsRequest
object
provided by io.pdfdata.API
instances.
Document lifecycle
Every document carries an expiration date, which is initially set to 30 days from when it is uploaded. This expiration window is reset each time a proc is started that sources data from a document; in this way, documents that are being actively touched by proc activity at least once every 30 days will remain in the system in perpetuity.
When a document expires, the PDF itself is purged from PDFDATA.io's systems, but the document record itself remains for reporting and billing purposes.
Uploading documents
Definition
POST https://api.pdfdata.io/v1/documents
pdfdata.documents.upload({ARRAY_OF_PDF_PATHS}, [{ARRAY_OF_TAGS}]);
pdfdata.documents().upload(java.io.File... files)
pdfdata.documents().upload(Collection<File> files)
pdfdata.documents().upload(Collection<File> files, Collection<String> tags)
Example request
curl https://api.pdfdata.io/v1/documents \
-u test_xkH4xlrO7K80J5CTwEjBeSo6: \
-F file=@{PATH_TO_PDF} \
-F file=@{PATH_TO_PDF2} \
-F tag={TAG_STRING}
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.documents.upload(["{PATH_TO_PDF}", "{PATH_TO_PDF2}"],
["{TAG_STRING}"])
.then(function (response) {
// handle response
})
.catch(function (error) {
// handle failure
});
API pdfdata = new API("test_xkH4xlrO7K80J5CTwEjBeSo6");
List<io.pdfdata.model.Document> docs =
pdfdata.documents()
.upload(new File("path/to/file.pdf"));
Documents are uploaded by sending a multipart/form-data
POST
request, which
can contain one or many documents along with their filenames.
Request parameters | ||
---|---|---|
file | Each file parameter's value should consist of the raw binary content of a source PDF document. The filename document attribute is sourced from the filename property of each file request parameter. Any provided type attribute is ignored; every range of file data is assumed to be a PDF document. |
|
tag optional | Each tag parameter is added to the uploaded documents' set of tags. Tags can be used to easily refer to groups of source PDF documents when creating new procs. Tags may not contain whitespace, or be the empty string. |
Parameters | ||
---|---|---|
documents | An Array of string paths to the source PDF documents to upload. The filename document attribute is sourced from the filename of the named local file. The content of each file is not checked at upload-time; every file is assumed to be a PDF document. |
|
tags optional | An Array of string tags, each added to the uploaded documents' set of tags. Tags can be used to easily refer to groups of source PDF documents when creating new procs. Tags may not contain whitespace, or be the empty string. |
Parameters | ||
---|---|---|
documents | A Collection of java.io.File s, the source PDF documents to upload. The filename document attribute is sourced from the filename of the named local file. The content of each file is not checked at upload-time; every file is assumed to be a PDF document. |
|
tags optional | A Collection of string tags, each added to the uploaded documents' set of tags. Tags can be used to easily refer to groups of source PDF documents when creating new procs. Tags may not contain whitespace, or be the empty string. |
The document upload response
Example response
java.util.List<io.pdfdata.model.Document> JSON [{
"type": "doc",
"filename": "document.pdf",
"created": "2016-06-11T18:23:33Z",
"expires": "2016-07-11T18:23:33Z",
"tags": ["acquired:2016-06-11"],
"id": "doc_8e96ec0033ac3e1e988b7d1ca27bfdc096b82ddc"
}, {
"type": "doc",
"filename": "document2.pdf",
"created": "2016-06-11T18:23:33Z",
"expires": "2016-07-11T18:23:33Z",
"tags": ["acquired:2016-06-11"],
"id": "doc_a5d8e5d1b99ac891226acb35f24a9f8f8eda50df"
}]
When successful, document uploads will yield a 201 Created
HTTP response, the
body of which will be a JSON-encoded array of the document objects corresponding
to the uploaded PDF files.
When successful, document uploads yield a response consisting of an Array
of
document objects corresponding to the uploaded PDF files.
When successful, document uploads yield a response consisting of a Collection
of io.pdfdata.model.Document
objects corresponding to the uploaded PDF files.
Document objects will always be given at least one tag indicating the date of
their initial creation
(i.e. the first time the corresponding PDF has been
seen by PDFDATA.io, like "acquired:2016-06-11"
), in addition to any tags
explicitly specified in the upload request.
If a document has been provided to PDFDATA.io previously, then uploading it again will do up to three things:
- The corresponding document object's
expires
attribute will be "reset" to 30 days from the time of the most recent upload. - At least one tag will be added to the document object, as described above.
- If the document had previously expired and the source PDF file had therefore been removed from storage, uploading again restores it and makes it available for inclusion in new procs.
Retrieving documents
Definition
GET https://api.pdfdata.io/v1/documents/{DOCUMENT_ID}
pdfdata.documents.get("{DOCUMENT_ID}");
pdfdata.documents.get("{DOCUMENT_ID}");
Example request
curl https://api.pdfdata.io/v1/documents/{DOCUMENT_ID}
-u test_xkH4xlrO7K80J5CTwEjBeSo6:
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.documents.get("{DOCUMENT_ID}")
.then(function (response) {
// handle response
})
.catch(function (error) {
// handle failure
});
API pdfdata = new API("test_xkH4xlrO7K80J5CTwEjBeSo6");
io.pdfdata.model.Document doc =
pdfdata.documents().byID("doc_XXXXXX");
Example response
io.pdfdata.model.Document JSON {
"type": "doc",
"filename": "document.pdf",
"created": "2016-06-11T18:23:33Z",
"expires": "2016-07-11T18:23:33Z",
"tags": ["acquired:2016-06-11"],
"id": "doc_8e96ec0033ac3e1e988b7d1ca27bfdc096b82ddc"
}
A single document object can be retrieved, given its id. (Note well that this
returns only the document object, not the previously-uploaded source PDF
document.) If the id is valid, the response will be the JSON representation of
the single document objectan
io.pdfdata.model.Document
object.
Listing Documents
Definition
GET https://api.pdfdata.io/v1/documents/
pdfdata.documents.list([options]);
pdfdata.documents().list()
pdfdata.documents().list(java.time.Instant createdBefore)
Example request
curl https://api.pdfdata.io/v1/documents/ \
-u test_xkH4xlrO7K80J5CTwEjBeSo6:
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.documents.list()
.then(function (response) {
// handle response
})
.catch(function (error) {
// handle failure
});
API pdfdata = new API("test_xkH4xlrO7K80J5CTwEjBeSo6");
List<io.pdfdata.model.Document> doc =
pdfdata.documents().list();
Multiple document objects can be retrieved by "listing" the top-level
/v1/documents
resource.
Multiple document objects can be retrieved by "listing" the documents
object.
Multiple document objects can be retrieved via the list
methods on the
DocumentsRequest
object provided by io.pdfdata.API
instances.
Example response
java.util.List<io.pdfdata.model.Document> JSON [{
"type": "doc",
"filename": "document.pdf",
"created": "2016-06-11T18:23:33Z",
"expires": "2016-07-11T18:23:33Z",
"tags": ["acquired:2016-06-11"],
"id": "doc_8e96ec0033ac3e1e988b7d1ca27bfdc096b82ddc"
}, {...},
{...}]
Document listing responses consist
of a JSON-encoded array of document objects are a List
of io.pdfdata.model.Document
objects, in descending
order of creation date (i.e. most-recently-acquired documents first). Note that
since documents objects' created
attribute is immutable and is set when a
document is first uploaded, a document object's position within the listing will
never change, even if its corresponding source PDF document is uploaded multiple
times.
Example request
curl https://api.pdfdata.io/v1/documents/ \
-d before=2016-06-11T18:23:33Z \
-u test_xkH4xlrO7K80J5CTwEjBeSo6:
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.documents.list({before:"2016-06-11T18:23:33Z"})
.then(function (response) {
// handle response
})
.catch(function (error) {
// handle failure
});
API pdfdata = new API("test_xkH4xlrO7K80J5CTwEjBeSo6");
List<io.pdfdata.model.Document> doc =
pdfdata.documents().list(Instant.ofEpochMilli(1465669413715L));
Example response
[{
"type": "doc",
"filename": "report-1819.pdf",
"created": "2016-06-10T06:13:18Z",
"created": "2016-07-10T06:13:18Z",
"tags": ["acquired:2016-06-10"],
"id": "doc_4a60c27d4653a359478f410be8c45617c59870db"
}, {...},
{...}]
The number of document objects returned in each listing response is capped. To
retrieve the next "page" of documents, another listing request must be made,
with a before
parameter equal to the created
attribute value of the last
document in the previous listing response.
Request parametersOptions | ||
---|---|---|
before optional | A java.time.Instant .An ISO 8601-formatted date/timestamp. Valid precisions are a full date + time of day UTC (e.g. 2016-06-11T18:23:33Z ), or a simple date (e.g. 2016-07-13 , which implies midnight UTC for the time component). The "page" of document objects returned in response will begin with the document bearing the latest created attribute prior (but not equal) to any provided before parameter value. Alternatively, before can be a JavaScript Date object; in this case, the full date + time of day are utilized. When omitted, the page of documents provided in response will begin with the most recently-created document object. |
Procs
A proc (short for "process") is the combination of a set of source PDF documents and a set of descriptions of data extraction operations to be applied to those documents. Once a proc is complete, the results are available as some combination of structured JSON data and binary resources, the structure of which depends on which operations were selected to be part of the proc.
PDFDATA.io offers many operations, and more are being added all the time. Detailed documentation on each of the available operations can be found here; this discussion of procs and the documentation of the proc-related portions of the API will make use of just two operations:
metadata
extracts document-level metadata present in nearly every PDF document, which is readily represented as a JSON objectimages
extracts bitmap images embedded in source PDF documents; the result is a mix of binary resources (the actual PNG or JPEG or TIFF bitmap) and structured data (JSON describing where images are located on each page, the format and size of each image, etc).
Procs are processes
Example (complete) Proc object
{ "type": "proc", "id": "proc_1554e77a1ee", "created": "2019-12-06T08:49:42Z", "source_tags": ["acquired:2019-12-06", "january-resumes"], "operations": [{ "op": "metadata" }, { "op": "images" }], "status": "complete", "docids": ["doc_8e96ec0033ac3e1e988b7d1ca27bfdc096b82ddc"], "documents": [{ "type": "doc", "id": "doc_8e9600cd7db5baf2fad83e4d8b48359678b24322", "filename": "8e9600cd7db5baf2fad83e4d8b48359678b24322.pdf", "tags": ["acquired:2016-07-01"], "created": "2016-07-01T18:46:21Z", "expires": "2016-08-01T18:46:21Z", "results": [{ "op": "metadata", "data": { "Title": "C:\\user\\workspace\\test.txt", "ModDate": "2009-11-11T07:27:58Z", "Producer": "Acrobat Web Capture 9.0", "CreationDate": "2009-11-11T07:12:00Z" } }, { "op": "images", "data": [{ "type": "page", "images": [{ "type": "img", "bounds": [62.362, 248.541, 72.362, 255.541], "resource": "rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a" }, { "type": "img", "bounds": [62.362, 173.397, 63.362, 174.397], "resource": "rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72" }], "dimensions": [595, 839], "pagenum": 0 }], "resources": { "rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a": { "id": "rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a", "format": "png", "mimetype": "image/png", "dimensions": [10.0, 7.0] }, "rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72": { "id": "rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72", "format": "png", "mimetype": "image/png", "dimensions": [1.0, 1.0] } } }] }] }
Procs are long-running, inherently asynchronous things. When you create a proc through the PDFDATA.io API, our services will get to work on it right away. But, it may take some time to complete such that you can receive the resulting extracted data: sometimes seconds, but often many minutes depending on the "size" of the proc (how many source documents and how many data extraction operations are included).
As described below, the proc API will return data extraction results as part of the initial response to a request to create a new proc when possible. This is nice when it happens, but the potentially long-running nature of procs means that much of the time (and probably always in a production context), you should expect to make at least two API calls for each proc: one to configure and create a proc, and one (or more) to check the status of the proc and retrieve its extracted data.
Proc object attributes | ||
---|---|---|
type | The type of this object, always | |
id | Each proc is assigned a unique identifier when it is created. | |
created | The time when this proc was created. | |
source_tags | A collection of string tags associated with this document. These tags can be added when documents are provided to PDFDATA.io, and used later to start procs over groups of documents that share a given tag. If this proc's set of source documents were specified by tags, they will be recorded here. | |
operations | A collection of specifications of data extraction operations that the proc will or has applied to its source documents. | |
status | Either | |
docidsoptional | Present only when a proc's | |
documentsoptional | When a proc's |
All proc operations are mediated through the /v1/procs
resource and its
descendants.
All proc operations are mediated through the procs
object provided by the
pdfdata
module.
All proc operations are mediated through the ProcsRequest
object
provided by io.pdfdata.API
instances.
Creating procs
Definition
POST https://api.pdfdata.io/v1/procs
pdfdata.procs.configure({operations: {ARRAY_OF_OPERATION_SPECS},
[file: {ARRAY_OF_PDF_PATHS},]
[tag: {ARRAY_OF_TAGS},]
[docid: {ARRAY_OF_DOCUMENT_IDS},]
[wait: {SECONDS}]})
.start()
pdfdata.procs.configure()
.operations({ARRAY_OF_OPERATION_SPECS})
.withFiles({ARRAY_OF_PDF_PATHS})
.withTags({ARRAY_OF_TAGS})
.withDocuments({ARRAY_OF_DOCUMENT_IDS})
.start()
pdfdata.procs().configure()
// multiple .withXXX configuration calls
.start();
Example request: creating a proc over new source documents being uploaded
curl https://api.pdfdata.io/v1/procs \
-u test_xkH4xlrO7K80J5CTwEjBeSo6: \
-F file=@{PATH_TO_PDF} \
-F file=@{PATH_TO_PDF2} \
-F operations='[{"op":"metadata"}]'
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.procs.configure()
.withFiles(["{PATH_TO_PDF}", "{PATH_TO_PDF2}"])
.operations([{op:"metadata"}])
.start()
.then(function (response) {
// handle response
})
.catch(function (error) {
// handle failure
});
API pdfdata = new io.pdfdata.API("test_xkH4xlrO7K80J5CTwEjBeSo6");
Proc proc = pdfdata.procs().configure()
.withFiles("{PATH_TO_PDF}", "{PATH_TO_PDF2}")
.withOperations(new Metadata())
.start()
Example request: creating a proc over documents identified by ID
curl https://api.pdfdata.io/v1/procs \
-u test_xkH4xlrO7K80J5CTwEjBeSo6: \
-d docid={DOCUMENT_ID} \
-d docid={DOCUMENT_ID2} \
-d operations='[{"op":"metadata"}]'
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.procs.configure()
.withDocuments(["{DOCUMENT_ID}", "{DOCUMENT_ID2}"])
.operations([{op:"metadata"}])
.start()
.then(function (response) {
// handle response
})
.catch(function (error) {
// handle failure
});
API pdfdata = new io.pdfdata.API("test_xkH4xlrO7K80J5CTwEjBeSo6");
Proc proc = pdfdata.procs().configure()
.withDocumentIDs("{DOCUMENT_ID}", "{DOCUMENT_ID2}")
.withOperations(new Metadata())
.start()
Example request: creating a proc over documents selected via tags
curl https://api.pdfdata.io/v1/procs \
-u test_xkH4xlrO7K80J5CTwEjBeSo6: \
-d tag={TAG} \
-d tag={TAG2} \
-d operations='[{"op":"metadata"}]'
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.procs.configure()
.withTags(["{TAG}", "{TAG2}"])
.operations([{op:"metadata"}])
.start()
.then(function (response) {
// handle response
})
.catch(function (error) {
// handle failure
});
API pdfdata = new io.pdfdata.API("test_xkH4xlrO7K80J5CTwEjBeSo6");
Proc proc = pdfdata.procs().configure()
.withTags("{TAG1}", "{TAG2}")
.withOperations(new Metadata())
.start()
Procs are created by sending a POST
request that:
Procs are created by configuring a proc
object and then .start()
ing it. This
configuration:
Procs are created by configuring a proc
object via the ProcCreationBuilder
available from pdfdata.procs().configure()
, and then .start()
ing it. This
configuration:
- Declares which source documents should be included in the proc. This can be
done by one of:
- enumerating the IDs of already-uploaded documents
- providing tags that should be used to select many already-uploaded documents that were previously assigned those tags
- uploading source PDF documents directly as part of the proc-creation request
- Provides specifications of which data extraction operations (plus optional configuration) should be applied to those documents.
The configuration can be provided via an object literal to the .configure()
method, or built up by using the fluent "builder"-style API
(e.g. .withFiles(["path/to/document.pdf"])
instead of .configure({file:
["path/to/document.pdf"]})
), or a combination of both.
Proc configuration optionsRequest parameters Configuration options one and only one of document IDs, files, or tags are required/allowed | ||
---|---|---|
docid Document IDs optional | When provided, must be an ID of an unexpired source document previously provided to PDFDATA.io. Multiple document IDs can be specified via multiple docid request parameters. must be an array or Collection of IDs of unexpired source documents previously provided to PDFDATA.io. If any provided document ID is unknown, or refers to an expired document, then the entire request will fail, and no proc will be created. The .withDocuments() method can be used to add document IDs to a proc configuration. |
|
fileFiles optional | When provided, must be the binary content of a source PDF document. Multiple file s can be uploaded via multiple file request parameters. When this parameter is used, the entire request must be encoded as multipart/form-data . must be an array or Collection of string paths or java.io.File objects naming PDF files. The handling of file parameters and semantics around document objects when creating a proc are identical to those when uploading documents separately. The .withFiles() method can be used to add PDF files to be uploaded to a proc configuration. |
|
tagTags optional | When provided, must be a tagan array or Collection of string tags associated with unexpired source documents previously provided to PDFDATA.io. Multiple tags can be specified via multiple tag request parameters. The sets of documents selected by each tag are merged to form the new proc's working set (i.e. tags select documents disjunctively). If none of the provided tags are associated with unexpired source documents, then the request will fail, and no proc will be created. The .withTags() method can be used to add source document tags to a proc configuration. |
|
operations | An JSON-encoded array or Collection of io.pdfdata.model.Operation objects describing data extraction operations. Each of the operations will be applied to each of the proc's source documents. The .operations() method can be used to add operations to a proc configuration. |
|
wait optional | By default, PDFDATA.io will wait 30 seconds for a proc to complete before issuing a response to the proc-creation request. For very small workloads and simple data extraction operations, this means that many procs can be created and results returned with a single API call. This option allows you specify a shorter period, in seconds. wait parameter values larger than 30 will be ignored. Typical usage would be to provide a wait value of 0 , which will cause the API to issue a pending proc response immediately; a later request will then be necessary to check on the proc's status and potentially retrieve results.See "Procs are processes" for further background / overview of the asynchronous nature of procs. |
Procs can be created with document IDs, or document tags, or by uploading new documents, but these options are exclusive (i.e. you cannot upload new documents, and specify tags to select additional previously-uploaded documents to be included in the proc).
When a document is included in a new proc, its expires
attribute is "reset" to
30 days from the time of the proc's creation, mirroring the expiration extension
that occurs when a document is re-uploaded to PDFDATA.io.
PDFDATA.io provides many, many data extraction operations. The particulars of what configuration options are supported by each operation and the shape of the structured data and flavour of binary resources they extract are all described here.
The proc creation response
Successfully creating a proc will produce an API response containing the new proc object. That response will:
- include a
Location
header, the value of which will be the canonical URL for the new proc. That URL can be retrieved later in order to check the proc's status, or retrieve its data extraction results. - bear an HTTP 201 (created) status if the proc completes before the response is sent. Otherwise, a status of 202 (accepted) will be sent back, indicating the proc's pending status at the time of the response.
The body of the proc-creation response will be the JSON-encoded proc object itself, equivalent to separately requesting the proc object later via its canonical URL, described next.
Successfully creating and starting a proc will produce the created proc object as a response, equivalent to separately requesting the proc object later via its canonical ID, described next.
Successfully creating and starting a proc will return a io.pdfdata.model.Proc
object as a response, equivalent to separately requesting the proc later via its
canonical ID, described next.
Getting the results of a proc
Definition
GET https://api.pdfdata.io/v1/procs/{PROC_ID}
pdfdata.procs.get({PROC_ID});
pdfdata.procs().byID(String procID)
Example request
curl https://api.pdfdata.io/v1/procs/proc_15555c7e6c2
-u test_xkH4xlrO7K80J5CTwEjBeSo6:
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.procs.get("proc_15555c7e6c2")
.then(function (response) {
// handle response
})
.catch(function (error) {
// handle failure
});
API pdfdata = new io.pdfdata.API("test_xkH4xlrO7K80J5CTwEjBeSo6");
Proc proc = pdfdata.procs().byID("proc_15555c7e6c2");
Example (pending) response
io.pdfdata.model.Proc JSON {
"type": "proc",
"id": "proc_1555580e8ff",
"created": "2016-06-15T19:19:19Z",
"source_tags": [],
"operations": [{
"op": "metadata"
}],
"status": "pending",
"docids": [
"doc_8e9600cd7db5baf2fad83e4d8b48359678b24322",
"doc_8e96ec0533ac3e1e988b7d1ca27bfdc096b82ddc",
"doc_a5d8e5d0b99ac891226acb35f24a9f8f8eda50df"
]
}
Example (completed) response
io.pdfdata.model.Proc JSON {
"type": "proc",
"id": "proc_1555579d8df",
"created": "2016-06-15T19:11:36Z",
"source_tags": [],
"operations": [{
"op": "metadata"
}],
"status": "complete",
"documents": [{
"type": "doc",
"id": "doc_8e96ec0533ac3e1e988b7d1ca27bfdc096b82ddc",
"filename": "8e96ec0533ac3e1e988b7d1ca27bfdc096b82ddc.pdf",
"tags": ["acquired:2016-06-15"],
"created": "2016-06-15T19:11:36Z",
"expires": "2016-07-15T19:15:32Z",
"results": [{
"op": "metadata",
"data": {
"Creator": "SPDF",
"Title": "Microarray Gene Expression Data with Linked Survival Phenotypes",
"Producer": "AppendPro 3.0 Linux 7 SPDF_1085 May 15 2003",
"Subject": "Center for Bioinformatics & Molecular Biostatistics",
"ModDate": "2006-01-24T20:38:13Z",
"Keywords": "Diffuse large-B-cell lymphoma; Gene harvesting; Least angle regression",
"CreationDate": "2006-01-24T20:38:13Z",
"Author": "Mark R. Segal",
"SPDF": 1085,
"Changes":
[{"CreationDate": "2006-01-24T20:38:13Z",
"Producer": "SPDF",
"ModDate": "2006-01-24T20:38:13Z",
"Creator": "SPDF"}]
}
}]
}]
}
Proc objects always exist in one of two primary states:
- with a
status
of"pending"
, and the collection of string document IDs included in the proc available under thedocids
attribute via the.getDocIDs()
method - with a
status
of"complete"
, and a collection of document objects included in the proc under thedocuments
attribute via the.getDocuments()
method. When obtained from a completed proc, each document object is extended with an additionalresults
attribute, an arrayis an instance of theio.pdfdata.model.ProcessedDocument
subclass, which provides an additional.getResults()
method providing access to a collection of extracted data corresponding to (and in the same order as) theoperations
specified when the proc was created.
If a new proc is completed before the proc-creation response is issued (the
upper bound of which is controlled by the wait
parameter), then the
response to that request will include the resulting extracted data. Otherwise,
you will need to make one or more additional API calls to check the status of
the proc and receive its results when it is complete.
This is done by issuing a GET
request for the proc's canonical URL, which is
included in the Location
header of every proc-creation response. (You can also
reliably construct this URL for any known proc ID.) The response will always be
the single identified proc object.
This is done by calling the pdfdata.procs.get()
method with a proc ID string
(or a pending proc object), or by calling pdfdata.procs.getCompleted()
to
obtain a promise that will automatically be fulfilled when the proc is complete,
described next.
This is done by calling the pdfdata.procs().byID(String)
method with a proc ID string.
Waiting for proc completion
Waiting 5 minutes for a proc to complete
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.procs.getCompleted("proc_15555c7e6c2", 5 * 60 * 1000)
.then(function (proc) {
if (proc.status == "complete") {
// handle completed proc
} else {
// proc is still "pending"
}
})
.catch(function (error) {
// handle failure
});
"Manually" polling pdfdata.procs.get()
periodically to check whether a proc
has completed its work is not a pleasant programming task. Therefore,
pdfdata-node
provides a simpler way to be notified when a proc is completed.
The example to the right demonstrates the use of pdfdata.procs.getCompleted()
,
which returns a promise that is fulfilled when the identified proc is completed,
or when the specified timeout expires.
A call to .getCompleted()
can be easily applied to the result of creating a
proc, wrapping up that creation and then any waiting for a the proc to finish
into a single block of promise calls and handling.
pdfdata.procs.getCompleted() arguments |
||
---|---|---|
procid | The string ID of the proc to retrieve (or, a full proc object, e.g. a pending proc produced by a proc creation call) | |
timeout_ms | How long to wait (in milliseconds) for the identified proc to complete. If this period expires, the returned promise will be fulfilled with a still-pending proc object. | |
polling_interval optional | How often to poll the PDFDATA.io API to check on the status of the proc. 1000 is both the default, and the minimum value. |
The structure of data extraction results
PDF documents contain a wide range of data types: key/value metadata, on-page character data, bitmap images, form data, vector graphics, and more. PDFDATA.io offers an even wider range of operations that extract this data, some of which reorganize and integrate those fundamental data types in different ways (for example, to selectively extract text based on your own criteria, infer the location and structure of tabular data, rasterize vector graphics, and so on). The structure of a proc's data extraction results will depend entirely on the operations you select and configure when creating it.
The results
attribute added to each document object in a completed proc
response is always a collection of operation result objects (all subclasses of
io.pdfdata.model.Operation.Result
), in the same order they were
specified when the proc was created. Operations can produce two kinds of
results; depending on what an operation does, it may produce either or both of:
- data, the structure of which will vary from operation to operation
- resources are data extraction results that cannot be readily represented using JSON (e.g. binary assets like extracted bitmap images)
Data may refer to resources within the same operation's results, if the
operation produces both. For example, a sub-attribute of the data
value may
describe the position and location of an image, referring to the image's
corresponding resource, which can be retrieved separately.
Please refer to the documentation for each operation for particulars on what data and/or resources they produce.
Proc and operation failures
Example of failed operation result object
io.pdfdata.model.Proc JSON {
"type": "proc",
"id": "proc_15651e77af7",
"created": "2016-08-03T19:35:40Z",
"source_tags": [],
"operations": [{
"op": "metadata"
}],
"documents": [{
"type": "doc",
"id": "doc_9662c6d4f7a7eedeb1304688c5767cfa84db067a",
"filename": "report.png",
"tags": ["acquired:2016-08-03"],
"created": "2016-08-03T19:32:44Z",
"expires": "2016-09-02T19:35:40Z",
"results": [{
"op": "metadata",
"failure": true
}]
}, {
"type": "doc",
"id": "doc_144a4c8c770dba924e924c8ee26099d585c83986",
"filename": "144a4c8c770dba924e924c8ee26099d585c83986.pdf",
"tags": ["acquired:2016-08-03"],
"created": "2016-08-03T19:35:32Z",
"expires": "2016-09-02T19:35:40Z",
"results": [{
"op": "metadata",
"data": {
"Title": "Content Categorization Methodologies: The Good, The Bad, The Best, and Why",
"Author": "Claude Vogel",
"Creator": "Microsoft Word 9.0",
"ModDate": "2002-11-08T14:27:29Z",
"Producer": "Acrobat Distiller 4.05 for Windows",
"CreationDate": "2002-10-15T12:49:49Z"
}
}]
}],
"status": "complete"
}
As discussed elsewhere, the PDFDATA.io API may produce errors under
various circumstances that are very clear, e.g. returning an HTTP response with
a 4xx status code if a request is malformed in some way, which will cause a io.pdfdata.APIException
to be
thrown. However, because procs' work is done
asynchronously, and because each proc may coordinate the
application of more than one operation to more than one source document,
standard HTTP response statuses and imperative client-side error handling isn't
sufficient to capture the general case of e.g. a single operation failing to
extract the requested data from a single source document, but succeeding with
hundreds or thousands of other source documents, all part of the same proc.
In such cases where an operation fails to process a source document, the result
object it produces for that one document will not contain any extracted
data. Rather, it will contain only
the name of the failing op
eration, and a failure
attribute., and the result object's .isFailure()
method will
return true
.
In the example to the right, a proc was created applying the metadata
operation to two sources; one succeeded, and provided the expected data, but
because the second source was not actually a PDF document (an image, actually),
the result object produced by the metadata
operation indicated the processing
failure.
Operations
PDFDATA.io offers a number of different extraction operations. Some of them provide structured data, where the data elements that are extracted can be reliably known ahead of time (such as form data, or elements of labeled page templates); other extraction operations yield unstructured content (such as the text of entire pages or documents, or the images or attachments embedded in source PDFs).
Your application requirements will dictate what you need from your PDFs; you just need to choose the right PDFDATA.io operation to extract that content or data.
Each section that follows will detail each operation, providing:
- An overview of the operation and the PDF data it extracts.
- A "specification" and example of the configuration object that is provided when creating procs.
- Where possible, a specification of the shape of the data extraction results the operation produces, and always an example or two of those results.
Page Templates
Many types of documents consist of highly-structured data presented in a rigid,
unchanging visual layout. In these cases, it's easy to recover that data and
retain that structure by defining where on each page of a document the data
elements of interest will be found, and naming them as you'd like to appear in
the extracted data. This is what PDFDATA.io's page-templates
operation does.
page-templates
needs to know where on a page to find the data you're
interested in, and what that data should be named in the extraction result.
Example operation configuration
{ "op": "page-templates", "templates": { "2016 Form W-2": { "regions": { "employee_ssn": { "bounds": [152, 732, 278, 748] }, "gross_wages": { "bounds": [331, 708, 452, 724] }, "W-2": { "bounds": [56, 428, 103, 453], "contains": "W-2" }, "year": { "bounds": [263, 422, 335, 454], "contains": "2016" } } } } }
new io.pdfdata.model.ops.PageTemplates()
.withTemplate("2016 Form W-2", new PageTemplates.Template()
.withRegion("employee-ssn", new PageTemplates.Region(152, 732, 278, 748))
.withRegion("gross-wages", new PageTemplates.Region(331, 708, 452, 724))
.withRegion("W-2", new PageTemplates.Region(56, 428, 103, 453)
.containingString("W-2"))
.withRegion("year", new PageTemplates.Region(263, 422, 335, 454)
.matchingRegex("\\d{4}")))
page-templates operation configuration object attributes | ||
---|---|---|
op | Must be | |
templates | An object with uniquely-named |
When you start a proc including a page-templates
operation, PDFDATA.io will
attempt to match each page in each source document to each of the Template
s
you define in the operation configuration. A template has two possible
constraints:
- An optional
pagenum
; if defined, a template match will only be attempted against the indicated page from each source document in the proc. - The
regions
it defines, bounding boxes which delineate and name areas on a page from which data should be pulled, and which may carry certain constraints on that data. If any region's constraints are not satisfied, then that region's template will not match the page in question, and will not be included in the operation's output.
Template object attributes | ||
---|---|---|
regions | An object with uniquely-named | |
pagenumoptional | The source document's page number, zero-indexed. If a template has no assigned page number, then the proc will attempt to match it against every page in a source document; thus, multiple template matches may be returned, one for each matching page, insofar as the template's region specifications are satisfied (with regard to validation checks and so on). If a template does have an assigned page number, then the proc will only ever attempt to match it against that page number in source documents, and at most one match will returned. |
A template may define any number of regions, each of which may require that the
data found within its bounds
match or contain a regular expression or string
literal.
Note that, depending on a template's regions' constraints (or lack thereof), a template may match multiple pages from a single source document. In addition, if multiple templates are provided in the operation configuration, then multiple templates may match any given page from a source document. This may or may not be desirable, depending upon the nature of the source documents and your data extraction requirements.
Region Query object attributes | ||
---|---|---|
bounds | A description of a rectangular bounding box's coordinates. | |
matchoptional | A regular expression that must match some range of the text extracted from the region; otherwise, the region's template will not match that page. | |
containsoptional | A string that must be found within the text extracted from the region; otherwise, the region's template will not match that page. |
In the example page-templates
operation configuration shown to the right,
contains
constraints are used to ensure that the W-2 template only matches
pages that are actually W-2 forms of the particular year corresponding to the
size and position of the regions' bounds
. Other pages within each source
document that don't match these constraints (perhaps cover pages, different
editions of the W-2 form, or entirely different content altogether) will not
match that template; however, if multiple 2016 W-2 forms are collected into a
single source document, then this page-templates
configuration will find all
of them, and yield each match's data in the proc output.
Example
page-templates
result object
io.pdfdata.model.ops.PageTemplates.Result JSON {
"op": "page-templates",
"data": [{
"pagenum": 0,
"regions": {
"W-2": "W-2",
"year": "2016",
"gross-wages": "91827.12",
"employee-ssn": "987-65-4321"
},
"template": "2016 Form W-2"
}, {
"pagenum": 1,
"regions": {
"W-2": "W-2",
"year": "2016",
"gross-wages": "62307.90",
"employee-ssn": "123-45-6789"
},
"template": "2016 Form W-2"
}]
}
The result of a page-templates
operation is a sequence of Template Match
objects. The match objects' structure corresponds to the structure of the
templates provided in the operation configuration, containing the values for
each named region on each matched page.
Template Match object attributes | ||
---|---|---|
pagenum | The source document's page number, zero-indexed. | |
template | A string identifying the template that produced this match. | |
regions | An object with attribute names corresponding to template region names. Values will be the data extracted from those regions, either a string, or a |
Interactive form data
Filling an interactive PDF form in Acrobat
The PDF specification allows a document to contain any number of form fields,
similar to the text, radio button, checkbox, and select form fields found in
HTML, so that the document can be interactively updated by a user, saved, and
then submitted in order to gather data. PDFDATA.io's interactive-form
operation extracts this data.
Example operation configuration
{ "op": "interactive-form" }
interactive-form operation configuration object attributes | ||
---|---|---|
op | Must be |
The interactive-form
operation will always extract the data for all of the
fields in a source document. (A single PDF document only contains one "form", so
we're using the terms "document" and "form" interchangeably in this section of
the API documentation.) Interactive PDF forms can contain any number of the
following sorts of fields:
Types of interactive PDF form fields | ||
---|---|---|
"text" |
A simple editable text field. | |
"checkbox" |
A button with an associated on/off state. | |
"radiogroup" |
A set of buttons where at most one can have an "on" state. | |
"choice" |
A set of options — rendered as a drop-down menu or scrolling list — where one or more may be selected. |
Example
interactive-form
result object
{
"op": "interactive-form",
"data": [
{
"name": "f1-1",
"value": "John Doe",
"bounds": [86.6656, 699.602, 590.658, 714.268],
"pagenum": 0,
"fieldtype": "text"
},
{
"name": "f1-5",
"value": "New York, NY 10001",
"bounds": [86.33229, 603.26984, 410.66068, 617.9363],
"pagenum": 0,
"fieldtype": "text"
},
{
"name": "f1-4",
"value": "123 Canal St.",
"bounds": [86.66562, 627.2695, 409.994, 641.9359],
"pagenum": 0,
"fieldtype": "text"
},
....
{
"name": "c1-2",
"value": "/Yes",
"bounds": [251.663, 659.602, 259.996, 667.269],
"checked": false,
"pagenum": 0,
"fieldtype": "checkbox"
},
{
"name": "c1-1",
"value": "/Yes",
"bounds": [171.664, 658.602, 181.331, 668.269],
"checked": true,
"pagenum": 0,
"fieldtype": "checkbox"
}
]
}
Each form field found in a document is represented as a field object in the
data
array of the interactive-form
operation's result object. If a source
document does not contain any fields, then the result object's data
attribute
will be an empty array.
All form fields, regardless of type, may carry a number of common attributes:
Common form field object attributes | ||
---|---|---|
fieldtype | The type of this field, one of | |
name | The form field's name, unique within its source document/form. This name is set by the generator of the source PDF document, and generally will have no identifiable meaning. | |
mapping_nameoptional | The form field's "mapping" name, used to identify the field in exported form data formats. If this was set, it might correspond with a schema element name, database column, etc. | |
ui_nameoptional | The form field's human-friendly name that might be used for display to users, accessibility tools, and other interface contexts. | |
defaultoptional | The field's default value, a string. | |
pagenum | The source document's page number, zero-indexed. | |
bounds | A description of a rectangular bounding box's coordinates. |
In addition to these base attributes, some additional attributes will be present depending on the type of field:
Additional text form field object attributes | ||
---|---|---|
value | A string, the "plain text" value of this text field, may be | |
xhtml_valueoptional | The value of this text field, represented as a string containing XHTML content. Note: it is rare for interactive PDF forms to offer these sorts of "rich-text" fields. |
Additional checkbox form field object attributes | ||
---|---|---|
checked |
| |
value | The "current value" of this checkbox field. Checkboxes in interactive PDF forms (should) always have an associated value, even when they are unchecked. When unchecked, this value should be We generally recommend that your applications identify the semantics of each checkbox form field independently and rely upon the extracted form field's |
Additional radiogroup form field object attributes | ||
---|---|---|
options | An array of strings enumerating the possible | |
value | This radiogroup's value, a string selected from the |
Additional choice form field object attributes | ||
---|---|---|
values | An array of strings, the options selected in this | |
options | An object enumerating the possible |
Consuming PDF form data
There are two main strategies for consuming PDF form data:
- If you know what the form fields' names are ahead of time (maybe because you
have control over the form's generation and can control them, or perhaps
because you have already determined the correspondence between each field
name and your application's data model), then you can simply loop through
each form field, consume its state (
value
,values
, andchecked
attributes, depending on the field type), and apply it along with the field'sname
(ormapping_name
, as appropriate) to your data model. - If the forms you are consuming have meaningless
name
attributes (e.g."c1-1"
), then you will have to identify each form field with regard to your application's data model. If the forms you're consuming use a stable set of fieldname
s, then you can then proceed according to strategy (1). If fieldname
s change from form to form (a possibility if e.g. you are consuming IRS 1040 forms generated by many different PDF producers), then you would be better off to use the form field location information that theinteractive-form
operation provides (pagenum
andbounds
) to usefully identify each form field extracted from each document.
Document metadata
Most PDF documents contain a set of simple key/value metadata that often
includes baseline information like a document's title, author, creation date,
keywords, and what program or system generated the document. The metadata
operation extracts this data.
Example operation configuration
{ "op": "metadata" }
new io.pdfdata.model.ops.Metadata()
Example
metadata
result object
io.pdfdata.model.ops.Metadata.Result JSON {
"op": "metadata",
"data": {
"Creator": "SPDF",
"Title": "Microarray Gene Expression Data with Linked Survival Phenotypes",
"Producer": "AppendPro 3.0 Linux 7 SPDF_1085 May 15 2003",
"Subject": "Center for Bioinformatics & Molecular Biostatistics",
"ModDate": "2006-01-24T20:38:13Z",
"Keywords": "Diffuse large-B-cell lymphoma; Gene harvesting; Least angle regression",
"CreationDate": "2006-01-24T20:38:13Z",
"Author": "Mark R. Segal",
"SPDF": 1085,
"Changes":
[{"CreationDate": "2006-01-24T20:38:13Z",
"Producer": "SPDF",
"ModDate": "2006-01-24T20:38:13Z",
"Creator": "SPDF"}]
}
}
metadata operation configuration object attributes | ||
---|---|---|
op | Must be |
PDF document generators can opt to include any other metadata they choose in
PDFs they produce, beyond the common attributes enumerated below. Accessors for all of these common metadata attributes
are available on io.pdfdata.model.ops.Metadata.Result
objects.
Though there are no universal conventions for the keys used to convey such
additional metadata (whereas there is a universal convention defining the keys
for the common baseline metadata), there are domain-specific metadata key
conventions (e.g. within prepress, legal publishing, and other fields that often
enrich document metadata beyond the baseline keyset). You can access any such non-standard metadata attributes by querying the
full set of document metadata extracted from the source PDF, provided as a
Jackson JsonNode
by the
io.pdfdata.model.ops.Metadata.Result.getData()
method.
Common document metadata attributes | ||
---|---|---|
Title | The document's title. | |
Author | The name of the person who created the document. | |
Subject | The document's subject. | |
Keywords | Keywords associated with the document, usually comma- or semicolon-delimited. | |
Creator | If the document was converted to PDF from another format, the name of the application that created the original document from which it was converted. | |
Producer | If the document was converted to PDF from another format, the name of the application that converted it to PDF. | |
CreationDate | A timestamp indicating when the document was created. | |
ModDate | A timestamp indicating when the document was most recently modified. |
In addition to simple atomic values, document metadata attributes can contain
nested collections (arrays and/or objects), such as the Changes
attribute in
the example metadata result to the right. Such attributes are not common
— they are keyed outside of the baseline keyset documented above — but
applications should be written so as to accommodate their possibility.
XMP (XML) document metadata
Example operation configuration
{ "op": "xmp-metadata" }
new io.pdfdata.model.ops.XMPMetadata()
Example result object
io.pdfdata.model.ops.XMPMetadata.Result JSON {
"op": "xmp-metadata",
"resources": {
"rsrc_4218185ed0b3736f47ec787bb4142189cbcb057b": {
"url": "/v1/resources/rsrc_4218185ed0b3736f47ec787bb4142189cbcb057b",
"mimetype": "application/xml"
}
}
}
Example result when source PDF contains no XMP metadata
io.pdfdata.model.ops.XMPMetadata.Result JSON {
"op": "xmp-metadata",
"resources": {}
}
Example XMP metadata resource
This is a small example of the content of an extracted XMP metadata resource. Refer to the expansive XMP metadata specifications for a proper reference.
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:ModifyDate>2009-11-11T16:27:58+09:00</xmp:ModifyDate>
<xmp:CreateDate>2009-11-11T16:12+09:00</xmp:CreateDate>
<xmp:MetadataDate>2009-11-11T16:27:58+09:00</xmp:MetadataDate>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">C:\user\workspace\test.txt</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>uuid:b5582405-9d15-42f0-9eeb-80480c76463b</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:6affcf96-d90e-4900-82f4-7c90463c50cd</xmpMM:InstanceID>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Acrobat Web Capture 9.0</pdf:Producer>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
Some PDFs contain XMP metadata, an
alternative representation of document metadata that is encoded as XML. The
xmp-metadata
operation extracts this XML and provides it as a
resource available via the
.getResources()
method on the io.pdfdata.model.ops.XMPMetadata.Result
object
it produces.
If no XMP metadata is found in the source document, then the resources
attribute
of the operation result will be empty.
xmp-metadata
results carry no data
beyond its resources
.
xmp-metadata operation configuration object attributes | ||
---|---|---|
op | Must be |
Text
Example text extracted using
"decompose"
layout
Relaciones comerciales desde :
Crédito :
Consumo mensual :
Máximo consumo :
Plazo de pago :
Pago con cheque :
Pago con transferencia :
Pago en efectivo :
Productos que compran :
Cifras expresadas en :
Opinión :
Atraso en pagos :
Requieren garantías :
19/08/2014
2008
Abierto
300,000.00
500,000.00
90 días
NO
SI
NO
Equipo hotelero y de restaurante
Pesos
Muy Bueno
No tiene
NO
Example text extracted using
"preserve"
layout
19/08/2014
Relaciones comerciales desde : 2008
Crédito : Abierto
Consumo mensual : 300,000.00
Máximo consumo : 500,000.00
Plazo de pago : 90 días
Pago con cheque : NO
Pago con transferencia : SI
Pago en efectivo : NO
Productos que compran : Equipo hotelero y de restaurante
Cifras expresadas en : Pesos
Opinión : Muy Bueno
Atraso en pagos : No tiene
Requieren garantías : NO
PDFDATA.io's text
operation extracts the text contained in source PDF
documents. PDFDATA.io supports all languages, text encodings, character sets,
and writing modes, except for right-to-left writing systems like those
associated with Arabic, Hebrew, Urdu, and so on.
In order to accommodate different types of documents and expectations
of the sorts of content they contain, the text
operation provides two
different layout
options, "preserve"
and "decompose"
.
The difference between "preserve"
and "decompose"
is best illustrated with
an example. Consider the following portion of a PDF document:
The text extracted by PDFDATA.io using the two different layout
options is
shown to the right. The "decompose"
layout mode attempts to produce a
"linearization" of all of the text on each page that matches natural human
reading order; for example, text laid out in columns will be separated so that
each column's content will follow in sequence. This makes "decompose"
ideal
for use with documents that contain narrative content that you might later
subject to indexing, natural language processing analyses, semantic entity
labelling, summarization, and so on.
Example tabular text extracted using
"preserve"
layout
Original Beginning Interest Ending
Certificate Certificate Principal Interest Realized Loss Shortfall Total Certificate
Class Cusip Face Value Balance (1) Distribution Distribution (2) of Principal Amount Distribution Balance (1)
A-1 04541GGN6 $230,000,000.00 $0.00 $0.00 $0.00 N/A $0.00 $0.00 $0.00
A-2 04541GGP1 $268,000,000.00 ($0.00) $0.00 $0.00 N/A $0.00 $0.00 $0.00
A-3 04541GGQ9 $128,200,000.00 $0.00 $0.00 $0.00 N/A $0.00 $0.00 $0.00
A-IO 04541GGR7 $60,200,000.00 $0.00 $0.00 $0.00 N/A $0.00 $0.00 $0.00
M-1 04541GGS5 $45,000,000.00 $42,467,274.25 $0.00 $243,574.81 $0.00 $0.00 $243,574.81 $42,467,274.25
M-2 04541GGT3 $37,500,000.00 $37,500,000.00 $0.00 $228,743.68 $0.00 $0.00 $228,743.68 $37,500,000.00
M-3 04541GGU0 $11,250,000.00 $11,250,000.00 $0.00 $68,623.11 $0.00 $0.00 $68,623.11 $11,250,000.00
M-4 04541GGV8 $11,250,000.00 $9,031,924.44 $890,676.02 $55,093.22 $0.00 $0.00 $945,769.24 $8,141,248.42
M-5 04541GGW6 $9,370,000.00 $2,766,258.24 $24,577.16 $4,121.36 $0.00 $12,752.35 $28,698.52 $2,741,681.08
M-6 04541GGX4 $9,372,000.00 $2,864,767.61 $78,620.39 $0.00 $0.00 $17,474.60 $78,620.39 $2,786,147.22
P 04541GHA3 $100.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00
X 04541GGZ9 $0.00 $4,770,105.14 $0.00 $0.00 $0.00 $0.00 $0.00 $4,780,892.59
R 04541GHB1 $0.00 $0.00 $0.00 $0.00 N/A $0.00 $0.00 $0.00
B-IO 04541GGY2 $54,000,000.00 $0.00 $0.00 $0.00 N/A $0.00 $0.00 $0.00
Total $749,942,100.00 $105,880,224.54 $993,873.57 $600,156.18 $0.00 $30,226.95 $1,594,029.75 $104,886,350.97
In contrast, the (default) "preserve"
layout results in extracted
text that roughly matches the spatial arrangement of that text on each
page. This makes "preserve"
ideal for cases where post-processing of
PDFDATA.io-extracted text will be used to identify structured data elements,
like the label/data pairs in the example above, or regions like
this financial disclosure table:
The results of applying the text
operation with a layout
of "preserve"
to
this source PDF document yields the well-formatted text shown to the right.
Example operation configuration
{ "op": "text", "layout": "decompose" }
new io.pdfdata.model.ops.Text()
new io.pdfdata.model.ops.Text(Text.Layout.DECOMPOSE)
Example result object
io.pdfdata.model.ops.Text.Result JSON {
"op": "text",
"data": [{
"text": " 19/08/2014\n Relaciones comerciales desde : 2008\n Crédito : Abierto\n Consumo mensual : 300,000.00\n Máximo consumo : 500,000.00\n Plazo de pago : 90 días\n Pago con cheque : NO\n Pago con transferencia : SI\n Pago en efectivo : NO\n Productos que compran : Equipo hotelero y de restaurante\n Cifras expresadas en : Pesos\n Opinión : Muy Bueno\n Atraso en pagos : No tiene\n Requieren garantías : NO",
"type": "page",
"pagenum": 0,
"dimensions": [595, 842]
}]
}
text operation configuration object attributes | ||
---|---|---|
op | Must be | |
layoutoptional | Either |
text
operation results consist of a data
array containing page
objectsa List
of io.pdfdata.model.Page
objects
(one for each page in the source PDF document). Each page object has a string
text
attribute, the text extracted from that page according to the layout
configuration option specified when the proc was created.
text Page object attributes | ||
---|---|---|
type | The type of this object, always | |
pagenum | The source document's page number, zero-indexed. | |
dimensions | The | |
text | The text extracted from page |
Bitmap images
Example operation configuration
{ "op": "images" }
new io.pdfdata.model.ops.Images()
Many PDF documents include graphics by way of embedding images of the
bitmap/raster variety typically encoded and exchanged via JPEG, PNG, TIFF, and
similar formats. The images
operation extracts these bitmaps and their
rendered location from source PDF documents.
images operation configuration object attributes | ||
---|---|---|
op | Must be |
Example
images
result object
io.pdfdata.model.Images.Result JSON {
"op": "images",
"data": [{
"type": "page",
"dimensions": [595, 839],
"pagenum": 0,
"images": [{
"type": "img",
"bounds": [62.362, 248.541, 72.362, 255.541],
"resource": "rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a"
}, {
"type": "img",
"bounds": [62.362, 173.397, 63.362, 174.397],
"resource": "rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72"
}]
}],
"resources": {
"rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a": {
"url": "/v1/resources/rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a",
"mimetype": "image/png",
"dimensions": [10.0, 7.0]
},
"rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72": {
"url": "/v1/resources/rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72",
"mimetype": "image/png",
"dimensions": [1.0, 1.0]
}
}
}
The images
operation produces result objects that enumerate each page of the
source document where images were found, which contain an array of objects (one
for each image) that provide exact location and size information for the image
and refer to its corresponding resource.
Specifically, the data
attribute is a listing of each page of the source
document which contains images. Those page objects enumerate each rendered image
in their images
array attribute, and each image object describes the image's
position and extent via bounds
, referring to the corresponding resources that
hold the actual bitmap data via a resource
attribute.
images Page object attributes | ||
---|---|---|
type | The type of this object, always | |
pagenum | The source document's page number, zero-indexed. | |
dimensions | The | |
images | An array of |
img object attributes | ||
---|---|---|
type | The type of this object, always | |
bounds | A description of a rectangular bounding box's coordinates. The dimensions implied by these bounds may be different than the physical | |
resource | The ID of the resource associated with this object. Further data about the resource — including a |
In addition to the standard mimetype
and url
attributes carried by all
resource objects, resources produced by the images
operation include
a dimensions
attribute that describes the resource's bitmap data itself
(i.e. an image with dimensions of [100, 100]
might be rendered on different
pages of a source document scaled to different sizes, e.g. [50, 50]
, or
[10, 100]
).
images operation's additional resource object attributes | ||
---|---|---|
dimensions | The |
Please refer to this guide's section on resources for a treatment
on how the resources
attribute in operation results are organized, and how you
can retrieve resources' data via the PDFDATA.io API.
Document attachments
Example operation configuration
{ "op": "attachments" }
new io.pdfdata.model.ops.Attachments()
Example result object
io.pdfdata.model.ops.Attachments.Result JSON {
"op": "attachments",
"data": [{
"title": "Attachment1",
"bounds": [100.0, 722.0, 120.0, 742.0],
"pagenum": 0,
"location": "..//..//two_pilots.bmp",
"resource": "rsrc_5b9477c374e62a348f4423956684b028fe9acb95"
}, {
"title": "Attachment2",
"bounds": [100.0, 622.0, 120.0, 642.0],
"pagenum": 0,
"location": "..//..//License.rtf",
"resource": "rsrc_13ffe09daa1dd4a67b2ed8cec6e7a5fcc8c5156b"
}],
"resources": {
"rsrc_13ffe09daa1dd4a67b2ed8cec6e7a5fcc8c5156b": {
"url": "/v1/resources/rsrc_13ffe09daa1dd4a67b2ed8cec6e7a5fcc8c5156b",
"mimetype": "application/octet-stream"
},
"rsrc_5b9477c374e62a348f4423956684b028fe9acb95": {
"url": "/v1/resources/rsrc_5b9477c374e62a348f4423956684b028fe9acb95",
"mimetype": "application/octet-stream"
}
}
}
Attachment resources
PDF document attachments can be anything, just like email attachments.
Just like emails, PDF documents can contain attachments, separate files that
are not displayed as part of the PDF document. The attachments
operation
extracts all attachments found within a PDF document and provides them as
resources, along with whatever metadata is available for each
attachment.
attachments operation configuration object attributes | ||
---|---|---|
op | Must be |
In addition to the raw attachment data linked to in the resources
object in
each result, the data
attribute's array contains an object providing metadata
for each attachment.
Attachment object attributes | ||
---|---|---|
descriptionoptional | A description of the attachment or its contents, for human consumption. | |
locationoptional | Typically a filesystem path or URL, indicating from where the attachment was sourced. | |
resourceoptional | The ID of the resource associated with this object. Further data about the resource — including a | |
titleoptional | A further description of the attachment or its contents. Only present if an attachment is pinned to a particular location within the source document. | |
pagenumoptional | The source document's page number, zero-indexed. Only present if an attachment is pinned to a particular location within the source document. | |
boundsoptional | A description of a rectangular bounding box's coordinates. Only present if an attachment is pinned to a particular location within the source document. |
PDF document attachments can optionally be "pinned" to a location within the
source document, usually represented as a "push-pin" annotation on a particular
page. In this case, an attachment's data
object will include pagenum
,
bounds
, and title
attributes that describe the location and presentation of
their associated annotation.
Note that PDF document attachments do not explicitly indicate their content type
or mimetype, so all attachment resources will have a mimetype
of
"application/octet-stream"
. The location
attribute of each attachment's
data
object will usually include a filename (sometimes along with other path
information from when the file was attached to the PDF) that will include a file
extension (like .bmp
or .rtf
as in the example result to the right) that you
can use to infer the attachment's file or content type.
It is possible (though very rare) for a PDF document attachment to not
actually carry attachment data. In this case, the attachment location
will
typically be a URL referring to an "attachment" located elsewhere, and the
attachment object will lack a resource
attribute.
If no attachments are found in a source document, then the resources
and
data
attributes of the operation result will be empty.
Resources
Operation result object linking to extracted resource
io.pdfdata.model.ops.XMPMetadata.Result JSON {
"op": "xmp-metadata",
"resources": {
"rsrc_4218185ed0b3736f47ec787bb4142189cbcb057b": {
"url": "/v1/resources/rsrc_4218185ed0b3736f47ec787bb4142189cbcb057b",
"mimetype": "application/xml"
}
}
}
Operation result object including references to resources from
data
entities
io.pdfdata.model.ops.Images.Result JSON {
"op": "images",
"data": [{
"type": "page",
"dimensions": [595, 839],
"pagenum": 0,
"images": [{
"type": "img",
"bounds": [62.362, 248.541, 72.362, 255.541],
"resource": "rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a"
}, {
"type": "img",
"bounds": [62.362, 173.397, 63.362, 174.397],
"resource": "rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72"
}]
}],
"resources": {
"rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a": {
"url": "/v1/resources/rsrc_07a70ad3fca78c161846d0931058b6582c2ed94a",
"mimetype": "image/png",
"dimensions": [10.0, 7.0]
},
"rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72": {
"url": "/v1/resources/rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72",
"mimetype": "image/png",
"dimensions": [1.0, 1.0]
}
}
}
Some operations extract data which cannot be represented as a
structured JSON value, e.g. bitmap images,
attachments, and so on. Within the PDFDATA.io API,
these sorts of data entities are called resources. If an operation produces
resources, the results it delivers via completed proc
objects will include a resources
attribute, a JSON object that links to
and provides metadata for all of the resources extracted by the operation. Each
resource's binary data is then obtained via a separate API request to the
provided URL.carry a collection of
io.pdfdata.model.Resource
objects, each providing an InputStream
for their respective
binary data via a .get()
method.
Base resource object object attributes | ||
---|---|---|
url | The URL where this resource's data may be retrieved via an HTTP GET request. Note that this is will generally be a relative URL, and so must be resolved against your base PDFDATA.io API URL to retrieve the resource's data. | |
mimetype | The MIME type of this resource. It will ideally provide a useful and accurate indication of the resource's data's format, but depending on the operation and the particulars of how a source PDF document encodes the resource, it is possible for this to be |
Operations may further "enrich" the base set of resource object attributes of url
and
mimetype
with additional metadata that pertains strictly to the resources'
data (again, as in the case of images
results, which
provides resources that include an additional dimensions
attribute). Such extended types of resources are manifested as
operation-specific subclasses of io.pdfdata.model.Resource
.
Some operations' results will only include resources
(like
xmp-metadata
); others will include further information about
the use of those resources in the source PDF documents via the data
attribute
(like images
). In these cases, objects referring to resources will do so by
providing the resource's ID via a resource
attribute, which can then be used
to perform a lookup on the resources
object from the same operation
result.In these cases, result data
objects referring to resources will be a subclass of
io.pdfdata.model.ResourcefulEntity
, which provides a .getResource()
method
that simplifies the lookup of the associated resource.
Retrieving resources
Definition
GET https://api.pdfdata.io/{RESOURCE_URL}
pdfdata.resources.byID("{RESOURCE_ID}");
pdfdata.resources.byURL("{RESOURCE_URL}");
pdfdata.resources().byID(String resourceID);
pdfdata.resources().byURL(String resourceURL);
Example request
curl https://api.pdfdata.io/v1/resources/rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72
-u test_xkH4xlrO7K80J5CTwEjBeSo6:
var pdfdata = require("pdfdata")("test_xkH4xlrO7K80J5CTwEjBeSo6");
pdfdata.resources.byURL("/v1/resources/rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72")
.then(function (resourceMessage) {
// handle http.IncomingMessage
})
.catch(function (error) {
// handle failure
});
pdfdata.resources.byID("rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72")
.then(function (resourceMessage) {
// handle http.IncomingMessage
})
.catch(function (error) {
// handle failure
});
InputStream resourceData =
pdfdata.resources().byID("rsrc_9d7b8cffdb355edf6513a435e5465bbf7181ae72")
Example response
< HTTP/1.1 200 OK < Content-Type: application/xml < Content-Disposition: attachment; filename="8e9600cd7db5baf2fad83e4d8b48359678b24322-xmp.xml" < content-length: 3468 < ... resource's data follows ...
When you've identified the resource you'd like to retrieve, send an HTTP GET
request to the url
indicated in that resource's object. Those URLs are
relative, meaning that they can be easily resolved against e.g. the URL you
used to create and/or retrieve the completed proc object that included the
extracted data and resources, or against the PDFDATA.io API root directly
(https://api.pdfdata.io
).
When you've identified the resource you'd like to retrieve, call either
pdfdata.resources.byID()
or pdfdata.resources.byURL()
with the resource's ID
or url
, respectively. The result will be a promise that will be fulfilled with
an
http.IncomingMessage
object. Its attributes will be unmodified from those corresponding to the HTTP
response issued by the PDFDATA.io API; its body
attribute will be a Buffer
containing the resource's data.
A couple of headers included in resource responses are notable:
- The
content-type
header will match themimetype
attribute of the corresponding resource object delivered as part of the extracting operation's result - The
content-disposition
header will indicate a filename for the resource's data that is based on a hash of that data.
As discussed in the previous section, io.pdfdata.model.Resource
s always
provide an easy .get()
method for retrieving their binary data contents.
However, if you only have access to a resource's ID or URL for some reason, you
can obtain its contents via the methods provided by pdfdata.resources()
.