5a. Data Point Extraction
This section covers jobs of type Data Point Extraction. The previous section covered the Job resource at a high level.
Data Point Extraction Job Details
Data point extraction jobs start with a PDF or ePub document and produce an XML document as output in Innodom format. A taxonomy must also be specified in JSON format.
Creating a Data Point Extraction Job
            Creating a data point extraction job is done through POST /jobs and sending
            type as data-point-extraction.
          
            The URI of the input document must be sent as an input role in
            job.contents.
          
Job Metadata
The following job metadata fields are used for data point extraction job.
- 
              mapping.high_confidence_threshold— (optional) A confidence threshold between 0 and 1 inclusive. A value of 0 means all data points will pass through without human review. A value of 1 (or higher) means all data points will require human review. A value between 0 and 1 indicates some values may require human review, determined by the confidence of the system. Default value is 1.
- 
              mapping.taxonomy— (required) The name of the taxonomy to be used for this job.
- 
              text-extraction.ocr— (optional) A boolean indicating whether the system should attempt to performOCRfor text extraction. Default value is false.
- 
              zoning.high_confidence_threshold— (optional) A confidence threshold between 0 and 1 inclusive. A value of 0 means all zoned pages will pass through without human review. A value of 1 (or higher) means all zoned pages will require human review. A value between 0 and 1 indicates some zones may require human review, determined by the confidence of the system. Default value is 1.
- 
              zoning.high_resolution_image_height— (optional) A number indicating the desired height (in pixels) of the high resolution scans of the input document. Default value is 1024.
Example
            Below is an example of creating a new job of type data-point-extraction.
          
POST https://api.innodata.com/v1.1/jobs
Content-type: application/json
Authorization: Basic dXNlci1saXZlLTYzMmE1YTYzLWQ2ZDYtNDI0Ni05MWNhLWQ1NDY2MzI2OThkMzo=
Body:
{
  "collaboration": {
    "teams": [
      {
        "id": "c383d5a5-4cff-473f-b820-b53bb70abb78",
        "steps": ["*"]
      }
    ],
  },
  "input_content": {
    "role": "input",
    "uri": "https://api.innodata.com/v1.1/documents/f7afca0f-cd88-465a-bdac-421f7ada07fe/contents"
  },
  "metadata": {
    "mapping": {
      "taxonomy": "my-project-taxonomy.json"
    }
  },
  "type": "data-point-extraction"
}Mapping Taxonomy
The taxonomy determines the list of data points, their types, and more. It is expected that each job has a single, unchanging taxonomy through the lifetime of the job.
The documentation for designing a mapping taxonomy can be found here.
Automated Review
Data point extraction jobs support an automated review process. To indicate an automated review, the following job metadata fields must be specified when creating the job:
- 
              zoning.qa.teams— a list of JSON objects with two fields. Thefromfield should contain the team ID of the users who did the work. Thetofield should contain the team ID of the reviewers. This field will force a review of the zoning work before the job progresses to mapping.
- 
              mapping.qa.teams— a list of JSON objects with two fields. Thefromfield should contain the team ID of the users who did the work. Thetofield should contain the team ID of the reviewers. This field will force a review of the mapping work before the job progresses tocompleted.
Extracting the Output
            Once the job reaches the completed status, the final output will be placed
            in the role output of the job contents.
          
            This content will have a uri that can be used to retrieve the final output.
            The output is an XML document in Innodom format.