CorrectOCR.server module¶
Below are some examples for a possible frontend. Naturally, they are only suggestions and any workflow and interface may be used.
Example Workflow¶
Open the image in a new window to view at size.
Example User Interface¶
The Combo box would then contain the k-best suggestions from the backend, allowing the user to accept the desired one or enter their own correction.
Showing the left and right tokens (ie. tokens with index±1) enables to user to decide if a token is part of a longer word that should be hyphenated.
Endpoint Documentation¶
Errors are specified according to RFC 7807 Problem Details for HTTP APIs.
Resource |
Operation |
Description |
---|---|---|
1 Main |
Get list of documents |
|
2 Documents |
Get list of tokens in document |
|
3 Tokens |
Get random token |
|
Get token |
||
Update token |
||
Get token image |
- GET /¶
Get an overview of the documents available for correction.
The list will not include documents that the backend considers ‘done’, but they can still be accesses via the other endpoints.
Example response:
HTTP/1.1 200 OK Content-Type: application/json [ { "docid": "<docid>", "url": "/<docid>/tokens.json", "info_url": "...", "count": 100, "corrected": 87, "corrected_by_model": 80, "discarded": 10, "last_modified": 1605255523 } ]
- Response JSON Array of Objects
docid (string) – ID for the document.
url (string) – URL to list of Tokens in doc.
info_url (string) – URL that provides more info about the document. See workspace.docInfoBaseURL
count (int) – Total number of Tokens.
corrected (int) – Number of corrected Tokens.
corrected_by_model (int) – Number of Tokens that were automatically corrected by the model.
discarded (int) – Number of discarded Tokens.
last_modified (int) – The date/time of the last modified token.
- GET /(string: doc_id)/token-(int: doc_index).json¶
Get information about a specific
Token
.Returns
404
if the document or token cannot be found, otherwise200
.Note: If the token is the second part of a hyphenated token, and the server is configured for it, a
302
-redirect to the previous token will be returned.Note: The data is not escaped; care must be taken when displaying in a browser.
Example response:
HTTP/1.1 200 OK Content-Type: application/json { "Bin": 2, "Heuristic": "annotator", "Doc ID": "7696", "Gold": "", "Hyphenated": false, "Discarded": false, "Index": 2676, "Original": "Jornben.", "Selection": [], "Token info": "...", "Token type": "PDFToken", "Page": 1, "Frame": [0, 0, 100, 100], "Annotation info": "...", "image_url": "/7696/token-2676.png" "k-best": { 1: { "candidate": "Jornben", "probability": 2.96675056066388e-08 }, 2: { "candidate": "Joreben", "probability": 7.41372275428713e-10 }, 3: { "candidate": "Jornhen", "probability": 6.17986300962785e-10 }, 4: { "candidate": "Joraben", "probability": 5.52540106969346e-10 } }, "Last Modified": 1605255523 }
- Parameters
doc_id (string) – The ID of the requested document.
doc_index (int) – The placement of the requested Token in the document.
- Return
A JSON dictionary of information about the requested
Token
. Relevant keys for frontend display include original (uncorrected OCR result), gold (corrected version, if available). For further information, see the Token class.
- POST /(string: doc_id)/token-(int: doc_index).json¶
Update a given token with a gold transcription and/or hyphenation info.
Returns
404
if the document or token cannot be found, otherwise200
.If an invalid
hyphenate
value is submitted, status code400
will be returned.Note: If
gold
andhyphenate
are supplied, thegold
value will be inspected. If it contains a hyphen, the left and right parts will be set on the respective tokens. If it does not, the gold will be set on the leftmost token, and the right one discarded.Note: If the hyphenation is set to
left
, a redirect to the new “head” token will be returned.- Parameters
docid (string) – The ID of the requested document.
index (int) – The placement of the requested Token in the document.
- Request JSON Object
gold (string) – Set new correction for this Token.
info (string annotation) – Save some metadata about this correction (eg. username, date). Will only be saved if there is a gold correction.
hyphenate (string) – Optionally hyphenate to the left or right.
- Return
A JSON dictionary of information about the updated
Token
. NB: If the hyphenation is set toleft
, a redirect to the new “head” token will be returned.
- GET /(string: doc_id)/token-(int: doc_index).png¶
Returns a snippet of the original document as an image, for comparing with the OCR result.
Returns
404
if the document or token cannot be found, otherwise200
.- Parameters
docid (string) – The ID of the requested document.
index (int) – The placement of the requested Token in the document.
- Query Parameters
leftmargin (int) – Optional left margin. See
PDFToken.extract_image()
for defaults. TODOrightmargin (int) – Optional right margin.
topmargin (int) – Optional top margin.
bottommargin (int) – Optional bottom margin.
- Return
A PNG image of the requested
Token
.
- GET /(string: doc_id)/tokens.json¶
Get information about the
Tokens
in a given document.Returns
404
if the document cannot be found, otherwise200
.- Parameters
docid (string) – The ID of the requested document.
Example response:
HTTP/1.1 200 OK Content-Type: application/json [ { "info_url": "/<docid>/token-0.json", "image_url": "/<docid>/token-0.png", "string": "Example", "is_corrected": true, "is_discarded": false, "requires_annotator": false, "last_modified": 1605255523 }, { "info_url": "/<docid>/token-1.json", "image_url": "/<docid>/token-1.png", "string": "Exanpie", "is_corrected": false, "is_discarded": false, "requires_annotator": true, "has_error": false, "last_modified": null } ]
- Response JSON Array of Objects
info_url (string) – URL to Token info.
image_url (string) – URL to Token image.
string (string) – Current Token string.
is_corrected (bool) – Whether the Token has been corrected at the moment.
is_discarded (bool) – Whether the Token has been discarded at the moment.
last_modified (bool) – The date/time when the token was last modified.
- GET /random¶
Returns a 302-redirect to a random token from a random document. TODO: filter by needing annotator
Example response:
HTTP/1.1 302 Found Location: /<docid>/token-<index>.json