5a. Data Point Extraction
This section covers jobs of type Data Point Extraction. The previous section covered the Job resource at a high level.
Data Point Extraction Job Details
Data point extraction jobs start with a PDF or ePub document and produce an XML document as output in Innodom format. A taxonomy must also be specified in JSON format.
Creating a Data Point Extraction Job
Creating a data point extraction job is done through POST /jobs
and sending
type
as data-point-extraction
.
The URI
of the input document must be sent as an input
role in
job.contents
.
Job Metadata
The following job metadata fields are used for data point extraction job.
-
mapping.high_confidence_threshold
— (optional) A confidence threshold between 0 and 1 inclusive. A value of 0 means all data points will pass through without human review. A value of 1 (or higher) means all data points will require human review. A value between 0 and 1 indicates some values may require human review, determined by the confidence of the system. Default value is 1. -
mapping.taxonomy
— (required) The name of the taxonomy to be used for this job. -
text-extraction.ocr
— (optional) A boolean indicating whether the system should attempt to performOCR
for text extraction. Default value is false. -
zoning.high_confidence_threshold
— (optional) A confidence threshold between 0 and 1 inclusive. A value of 0 means all zoned pages will pass through without human review. A value of 1 (or higher) means all zoned pages will require human review. A value between 0 and 1 indicates some zones may require human review, determined by the confidence of the system. Default value is 1. -
zoning.high_resolution_image_height
— (optional) A number indicating the desired height (in pixels) of the high resolution scans of the input document. Default value is 1024.
Example
Below is an example of creating a new job of type data-point-extraction
.
POST https://api.innodata.com/v1.1/jobs
Content-type: application/json
Authorization: Basic dXNlci1saXZlLTYzMmE1YTYzLWQ2ZDYtNDI0Ni05MWNhLWQ1NDY2MzI2OThkMzo=
Body:
{
"collaboration": {
"teams": [
{
"id": "c383d5a5-4cff-473f-b820-b53bb70abb78",
"steps": ["*"]
}
],
},
"input_content": {
"role": "input",
"uri": "https://api.innodata.com/v1.1/documents/f7afca0f-cd88-465a-bdac-421f7ada07fe/contents"
},
"metadata": {
"mapping": {
"taxonomy": "my-project-taxonomy.json"
}
},
"type": "data-point-extraction"
}
Mapping Taxonomy
The taxonomy determines the list of data points, their types, and more. It is expected that each job has a single, unchanging taxonomy through the lifetime of the job.
The documentation for designing a mapping taxonomy can be found here.
Automated Review
Data point extraction jobs support an automated review process. To indicate an automated review, the following job metadata fields must be specified when creating the job:
-
zoning.qa.teams
— a list of JSON objects with two fields. Thefrom
field should contain the team ID of the users who did the work. Theto
field should contain the team ID of the reviewers. This field will force a review of the zoning work before the job progresses to mapping. -
mapping.qa.teams
— a list of JSON objects with two fields. Thefrom
field should contain the team ID of the users who did the work. Theto
field should contain the team ID of the reviewers. This field will force a review of the mapping work before the job progresses tocompleted
.
Extracting the Output
Once the job reaches the completed
status, the final output will be placed
in the role output
of the job contents
.
This content will have a uri
that can be used to retrieve the final output.
The output is an XML
document in Innodom
format.