CorrectOCR.server module¶

Below are some examples for a possible frontend. Naturally, they are only suggestions and any workflow and interface may be used.

Example Workflow¶

Open the image in a new window to view at size.

Example User Interface¶

$@startsalt { {+ left token | TOKEN IMAGE | right token . . | ^Suggestions^ | . . [Hyphenate left] | [Accept] | [Hyphenate right] } } @endsalt$

The Combo box would then contain the k-best suggestions from the backend, allowing the user to accept the desired one or enter their own correction.

Showing the left and right tokens (ie. tokens with index±1) enables to user to decide if a token is part of a longer word that should be hyphenated.

Endpoint Documentation¶

Errors are specified according to RFC 7807 Problem Details for HTTP APIs.

Resource	Operation	Description
1 Main	GET /	Get list of documents
2 Documents	GET /(string:doc_id)/tokens.json	Get list of tokens in document
3 Tokens	GET /random	Get random token
	GET /(string:doc_id)/token-(int:doc_index).json	Get token
	POST /(string:doc_id)/token-(int:doc_index).json	Update token
	GET /(string:doc_id)/token-(int:doc_index).png	Get token image

GET /¶

Get an overview of the documents available for correction.

The list will not include documents that the backend considers ‘done’, but they can still be accesses via the other endpoints.

Example response:

HTTP/1.1 200 OK
Content-Type: application/json

[
  {
    "docid": "<docid>",
    "url": "/<docid>/tokens.json",
    "info_url": "...",
    "count": 100,
    "corrected": 87,
    "corrected_by_model": 80,
    "discarded": 10,
    "last_modified": 1605255523
  }
]

Response JSON Array of Objects

docid (string) – ID for the document.
url (string) – URL to list of Tokens in doc.
info_url (string) – URL that provides more info about the document. See workspace.docInfoBaseURL
count (int) – Total number of Tokens.
corrected (int) – Number of corrected Tokens.
corrected_by_model (int) – Number of Tokens that were automatically corrected by the model.
discarded (int) – Number of discarded Tokens.
last_modified (int) – The date/time of the last modified token.

GET /(string: doc_id)/token-(int: doc_index).json¶

Get information about a specific Token.

Returns 404 if the document or token cannot be found, otherwise 200.

Note: If the token is the second part of a hyphenated token, and the server is configured for it, a 302-redirect to the previous token will be returned.

Note: The data is not escaped; care must be taken when displaying in a browser.

Example response:

HTTP/1.1 200 OK
Content-Type: application/json

{
  "Bin": 2,
  "Heuristic": "annotator",
  "Doc ID": "7696",
  "Gold": "",
  "Hyphenated": false,
  "Discarded": false,
  "Index": 2676,
  "Original": "Jornben.",
  "Selection": [],
  "Token info": "...",
  "Token type": "PDFToken",
  "Page": 1,
  "Frame": [0, 0, 100, 100],
  "Annotation info": "...",
  "image_url": "/7696/token-2676.png"
  "k-best": {
        1: { "candidate": "Jornben", "probability": 2.96675056066388e-08 },
        2: { "candidate": "Joreben", "probability": 7.41372275428713e-10 },
        3: { "candidate": "Jornhen", "probability": 6.17986300962785e-10 },
        4: { "candidate": "Joraben", "probability": 5.52540106969346e-10 }
  },
  "Last Modified": 1605255523
}

Parameters

doc_id (string) – The ID of the requested document.
doc_index (int) – The placement of the requested Token in the document.

Return

A JSON dictionary of information about the requested Token. Relevant keys for frontend display include original (uncorrected OCR result), gold (corrected version, if available). For further information, see the Token class.

POST /(string: doc_id)/token-(int: doc_index).json¶

Update a given token with a gold transcription and/or hyphenation info.

Returns 404 if the document or token cannot be found, otherwise 200.

If an invalid hyphenate value is submitted, status code 400 will be returned.

Note: If gold and hyphenate are supplied, the gold value will be inspected. If it contains a hyphen, the left and right parts will be set on the respective tokens. If it does not, the gold will be set on the leftmost token, and the right one discarded.

Note: If the hyphenation is set to left, a redirect to the new “head” token will be returned.

Parameters

docid (string) – The ID of the requested document.
index (int) – The placement of the requested Token in the document.

Request JSON Object

gold (string) – Set new correction for this Token.
info (string annotation) – Save some metadata about this correction (eg. username, date). Will only be saved if there is a gold correction.
hyphenate (string) – Optionally hyphenate to the left or right.

Return

A JSON dictionary of information about the updated Token. NB: If the hyphenation is set to left, a redirect to the new “head” token will be returned.

GET /(string: doc_id)/token-(int: doc_index).png¶

Returns a snippet of the original document as an image, for comparing with the OCR result.

Returns 404 if the document or token cannot be found, otherwise 200.

Parameters

docid (string) – The ID of the requested document.
index (int) – The placement of the requested Token in the document.

Query Parameters

leftmargin (int) – Optional left margin. See PDFToken.extract_image() for defaults. TODO
rightmargin (int) – Optional right margin.
topmargin (int) – Optional top margin.
bottommargin (int) – Optional bottom margin.

Return

A PNG image of the requested Token.

GET /(string: doc_id)/tokens.json¶

Get information about the Tokens in a given document.

Returns 404 if the document cannot be found, otherwise 200.

Parameters

docid (string) – The ID of the requested document.

Example response:

HTTP/1.1 200 OK
Content-Type: application/json

[
  {
    "info_url": "/<docid>/token-0.json",
    "image_url": "/<docid>/token-0.png",
    "string": "Example",
    "is_corrected": true,
    "is_discarded": false,
    "requires_annotator": false,
    "last_modified": 1605255523
  },
  {
    "info_url": "/<docid>/token-1.json",
    "image_url": "/<docid>/token-1.png",
    "string": "Exanpie",
    "is_corrected": false,
    "is_discarded": false,
    "requires_annotator": true,
    "has_error": false,
    "last_modified": null
  }
]

Response JSON Array of Objects

info_url (string) – URL to Token info.
image_url (string) – URL to Token image.
string (string) – Current Token string.
is_corrected (bool) – Whether the Token has been corrected at the moment.
is_discarded (bool) – Whether the Token has been discarded at the moment.
last_modified (bool) – The date/time when the token was last modified.

GET /random¶

Returns a 302-redirect to a random token from a random document. TODO: filter by needing annotator

Example response:

HTTP/1.1 302 Found
Location: /<docid>/token-<index>.json