5b. Keyword Indexing
This section covers jobs of type Keyword Indexing. The previous section covered the Job resource at a high level.
Keyword Indexing Job Details
Keyword indexing jobs start with a PDF or ePub document and produce an XML document as output in Innodom format. A taxonomy must also be specified in JSON format.
Creating a Keywords Indexing Job
Creating a keyword indexing job is done through POST /jobs
and sending
type
as indexing
.
The URI
of the input document must be sent as an input
role in
job.contents
.
Job Metadata
The following job metadata fields are used for data point extraction job.
-
indexing.high_confidence_threshold
— (optional) A confidence threshold between 0 and 1 inclusive. A value of 0 means all keywords will pass through without human review. A value of 1 (or higher) means all keywords will require human review. A value between 0 and 1 indicates some keywords may require human review, determined by the confidence of the system. Default value is 1. -
indexing.taxonomy
— (required) The name of the taxonomy to be used for this job. -
text-extraction.ocr
— (optional) A boolean indicating whether the system should attempt to performOCR
for text extraction. Default value is false. -
zoning.high_confidence_threshold
— (optional) A confidence threshold between 0 and 1 inclusive. A value of 0 means all zoned pages will pass through without human review. A value of 1 (or higher) means all zoned pages will require human review. A value between 0 and 1 indicates some zones may require human review, determined by the confidence of the system. Default value is 1. -
zoning.high_resolution_image_height
— (optional) A number indicating the desired height (in pixels) of the high resolution scans of the input document. Default value is 1024.
Example
Below is an example of creating a new job of type indexing
.
POST https://api.innodata.com/v1.1/jobs
Content-type: application/json
Authorization: Basic dXNlci1saXZlLTYzMmE1YTYzLWQ2ZDYtNDI0Ni05MWNhLWQ1NDY2MzI2OThkMzo=
Body:
{
"collaboration": {
"teams": [
{
"id": "c383d5a5-4cff-473f-b820-b53bb70abb78",
"steps": ["*"]
}
],
},
"input_content": {
"role": "input",
"uri": "https://api.innodata.com/v1.1/documents/f7afca0f-cd88-465a-bdac-421f7ada07fe/contents"
},
"metadata": {
"indexing": {
"taxonomy": "my-project-taxonomy.json"
}
},
"type": "indexing"
}
Extracting the Output
Once the job reaches the completed
status, the final output will be placed
in the role output
of the job contents
.
This content will have a uri
that can be used to retrieve the final output.
The output is an XML
document in Innodom
format.