5a. Data Point Extraction


This section covers jobs of type Data Point Extraction. The previous section covered the Job resource at a high level.

Data Point Extraction Job Details

Data point extraction jobs start with a PDF or ePub document and produce an XML document as output in Innodom format. A taxonomy must also be specified in JSON format.

Creating a Data Point Extraction Job

Creating a data point extraction job is done through POST /jobs and sending type as data-point-extraction.

The URI of the input document must be sent as an input role in job.contents.

Job Metadata

The following job metadata fields are used for data point extraction job.

  • mapping.high_confidence_threshold — (optional) A confidence threshold between 0 and 1 inclusive. A value of 0 means all data points will pass through without human review. A value of 1 (or higher) means all data points will require human review. A value between 0 and 1 indicates some values may require human review, determined by the confidence of the system. Default value is 1.
  • mapping.taxonomy — (required) The name of the taxonomy to be used for this job.
  • text-extraction.ocr — (optional) A boolean indicating whether the system should attempt to perform OCR for text extraction. Default value is false.
  • zoning.high_confidence_threshold — (optional) A confidence threshold between 0 and 1 inclusive. A value of 0 means all zoned pages will pass through without human review. A value of 1 (or higher) means all zoned pages will require human review. A value between 0 and 1 indicates some zones may require human review, determined by the confidence of the system. Default value is 1.
  • zoning.high_resolution_image_height — (optional) A number indicating the desired height (in pixels) of the high resolution scans of the input document. Default value is 1024.

Example

Below is an example of creating a new job of type data-point-extraction.

POST https://api.innodata.com/v1.1/jobs
Content-type: application/json
Authorization: Basic dXNlci1saXZlLTYzMmE1YTYzLWQ2ZDYtNDI0Ni05MWNhLWQ1NDY2MzI2OThkMzo=
Body:
{
  "collaboration": {
    "teams": [
      {
        "id": "c383d5a5-4cff-473f-b820-b53bb70abb78",
        "steps": ["*"]
      }
    ],
  },
  "input_content": {
    "role": "input",
    "uri": "https://api.innodata.com/v1.1/documents/f7afca0f-cd88-465a-bdac-421f7ada07fe/contents"
  },
  "metadata": {
    "mapping": {
      "taxonomy": "my-project-taxonomy.json"
    }
  },
  "type": "data-point-extraction"
}

Mapping Taxonomy

The taxonomy determines the list of data points, their types, and more. It is expected that each job has a single, unchanging taxonomy through the lifetime of the job.

The documentation for designing a mapping taxonomy can be found here.

Automated Review

Data point extraction jobs support an automated review process. To indicate an automated review, the following job metadata fields must be specified when creating the job:

  • zoning.qa.teams — a list of JSON objects with two fields. The from field should contain the team ID of the users who did the work. The to field should contain the team ID of the reviewers. This field will force a review of the zoning work before the job progresses to mapping.
  • mapping.qa.teams — a list of JSON objects with two fields. The from field should contain the team ID of the users who did the work. The to field should contain the team ID of the reviewers. This field will force a review of the mapping work before the job progresses to completed.

Extracting the Output

Once the job reaches the completed status, the final output will be placed in the role output of the job contents.

This content will have a uri that can be used to retrieve the final output. The output is an XML document in Innodom format.