5b. Keyword Indexing


This section covers jobs of type Keyword Indexing. The previous section covered the Job resource at a high level.

Keyword Indexing Job Details

Keyword indexing jobs start with a PDF or ePub document and produce an XML document as output in Innodom format. A taxonomy must also be specified in JSON format.

Creating a Keywords Indexing Job

Creating a keyword indexing job is done through POST /jobs and sending type as indexing.

The URI of the input document must be sent as an input role in job.contents.

Job Metadata

The following job metadata fields are used for data point extraction job.

  • indexing.high_confidence_threshold — (optional) A confidence threshold between 0 and 1 inclusive. A value of 0 means all keywords will pass through without human review. A value of 1 (or higher) means all keywords will require human review. A value between 0 and 1 indicates some keywords may require human review, determined by the confidence of the system. Default value is 1.
  • indexing.taxonomy — (required) The name of the taxonomy to be used for this job.
  • text-extraction.ocr — (optional) A boolean indicating whether the system should attempt to perform OCR for text extraction. Default value is false.
  • zoning.high_confidence_threshold — (optional) A confidence threshold between 0 and 1 inclusive. A value of 0 means all zoned pages will pass through without human review. A value of 1 (or higher) means all zoned pages will require human review. A value between 0 and 1 indicates some zones may require human review, determined by the confidence of the system. Default value is 1.
  • zoning.high_resolution_image_height — (optional) A number indicating the desired height (in pixels) of the high resolution scans of the input document. Default value is 1024.

Example

Below is an example of creating a new job of type indexing.

POST https://api.innodata.com/v1.1/jobs
Content-type: application/json
Authorization: Basic dXNlci1saXZlLTYzMmE1YTYzLWQ2ZDYtNDI0Ni05MWNhLWQ1NDY2MzI2OThkMzo=
Body:
{
  "collaboration": {
    "teams": [
      {
        "id": "c383d5a5-4cff-473f-b820-b53bb70abb78",
        "steps": ["*"]
      }
    ],
  },
  "input_content": {
    "role": "input",
    "uri": "https://api.innodata.com/v1.1/documents/f7afca0f-cd88-465a-bdac-421f7ada07fe/contents"
  },
  "metadata": {
    "indexing": {
      "taxonomy": "my-project-taxonomy.json"
    }
  },
  "type": "indexing"
}

Extracting the Output

Once the job reaches the completed status, the final output will be placed in the role output of the job contents.

This content will have a uri that can be used to retrieve the final output. The output is an XML document in Innodom format.