CorrectOCR.server module

Below are some examples for a possible frontend. Naturally, they are only suggestions and any workflow and interface may be used.

Example Workflow

@startuml

!include https://raw.githubusercontent.com/bschwarz/puml-themes/master/themes/cerulean/puml-theme-cerulean.puml

|Frontend|
start

:Get available documents
""GET /"";

|Backend|
:Look up and return
available documents from database;

|Frontend|
while (Documents available?) is (yes)
	:Select document and request list of tokens
	""GET /<docid>/tokens.json"";

	|Backend|
	:Look up document and return
	list of tokens from database;

	|Frontend|
	while (Tokens available?) is (yes)
		:Request token info from server
		""GET /<docid>/token-<index>.json""
		""GET /<docid>/token-<index>.png""
	
		or
	
		""GET /random""
		(redirects to a random token's JSON);
	
		|Backend|
		:Look up and return
		token from database;
	
		|Frontend|
		:Present user with a
		token to evaluate;

		:User chooses;
	
		if (accept)
			:Submit choice to server:
			""POST /<docid>/token-<index>.json""
			with //original// as ""gold"" parameter;
		elseif (correct)
			:Submit choice to server:
			""POST /<docid>/token-<index>.json""
			with //user input// as ""gold"" parameter;
		elseif (hyphenate)
			:Submit choice to server:
			""POST /<docid>/token-<index>.json""
			with //left// or //right// as ""hyphenate"" parameter;
		else (nothing)
			stop
		endif

		|Backend|
		:Write ""gold"" token to database;
	
	endwhile (no)
	'TODO fix arrow
	-[#blue]->
endwhile (no)

|Frontend|
stop

@enduml

Open the image in a new window to view at size.

Example User Interface

@startsalt
{
	{+
		left token | TOKEN IMAGE | right token
		.
		. | ^Suggestions^ | .
		.
		[Hyphenate left] | [Accept] | [Hyphenate right]
	}
}
@endsalt

The Combo box would then contain the k-best suggestions from the backend, allowing the user to accept the desired one or enter their own correction.

Showing the left and right tokens (ie. tokens with index±1) enables to user to decide if a token is part of a longer word that should be hyphenated.

Endpoint Documentation

Errors are specified according to RFC 7807 Problem Details for HTTP APIs.

Resource

Operation

Description

1 Main

GET /

Get list of documents

2 Documents

GET /(string:doc_id)/tokens.json

Get list of tokens in document

3 Tokens

GET /random

Get random token

GET /(string:doc_id)/token-(int:doc_index).json

Get token

POST /(string:doc_id)/token-(int:doc_index).json

Update token

GET /(string:doc_id)/token-(int:doc_index).png

Get token image

GET /

Get an overview of the documents available for correction.

The list will not include documents that the backend considers ‘done’, but they can still be accesses via the other endpoints.

Example response:

HTTP/1.1 200 OK
Content-Type: application/json

[
  {
    "docid": "<docid>",
    "url": "/<docid>/tokens.json",
    "info_url": "...",
    "count": 100,
    "corrected": 87,
    "corrected_by_model": 80,
    "discarded": 10,
    "last_modified": 1605255523
  }
]
Response JSON Array of Objects
  • docid (string) – ID for the document.

  • url (string) – URL to list of Tokens in doc.

  • info_url (string) – URL that provides more info about the document. See workspace.docInfoBaseURL

  • count (int) – Total number of Tokens.

  • corrected (int) – Number of corrected Tokens.

  • corrected_by_model (int) – Number of Tokens that were automatically corrected by the model.

  • discarded (int) – Number of discarded Tokens.

  • last_modified (int) – The date/time of the last modified token.

GET /(string: doc_id)/token-(int: doc_index).json

Get information about a specific Token.

Returns 404 if the document or token cannot be found, otherwise 200.

Note: If the token is the second part of a hyphenated token, and the server is configured for it, a 302-redirect to the previous token will be returned.

Note: The data is not escaped; care must be taken when displaying in a browser.

Example response:

HTTP/1.1 200 OK
Content-Type: application/json

{
  "Bin": 2,
  "Heuristic": "annotator",
  "Doc ID": "7696",
  "Gold": "",
  "Hyphenated": false,
  "Discarded": false,
  "Index": 2676,
  "Original": "Jornben.",
  "Selection": [],
  "Token info": "...",
  "Token type": "PDFToken",
  "Page": 1,
  "Frame": [0, 0, 100, 100],
  "Annotation info": "...",
  "image_url": "/7696/token-2676.png"
  "k-best": {
        1: { "candidate": "Jornben", "probability": 2.96675056066388e-08 },
        2: { "candidate": "Joreben", "probability": 7.41372275428713e-10 },
        3: { "candidate": "Jornhen", "probability": 6.17986300962785e-10 },
        4: { "candidate": "Joraben", "probability": 5.52540106969346e-10 }
  },
  "Last Modified": 1605255523
}
Parameters
  • doc_id (string) – The ID of the requested document.

  • doc_index (int) – The placement of the requested Token in the document.

Return

A JSON dictionary of information about the requested Token. Relevant keys for frontend display include original (uncorrected OCR result), gold (corrected version, if available). For further information, see the Token class.

POST /(string: doc_id)/token-(int: doc_index).json

Update a given token with a gold transcription and/or hyphenation info.

Returns 404 if the document or token cannot be found, otherwise 200.

If an invalid hyphenate value is submitted, status code 400 will be returned.

Note: If gold and hyphenate are supplied, the gold value will be inspected. If it contains a hyphen, the left and right parts will be set on the respective tokens. If it does not, the gold will be set on the leftmost token, and the right one discarded.

Note: If the hyphenation is set to left, a redirect to the new “head” token will be returned.

Parameters
  • docid (string) – The ID of the requested document.

  • index (int) – The placement of the requested Token in the document.

Request JSON Object
  • gold (string) – Set new correction for this Token.

  • info (string annotation) – Save some metadata about this correction (eg. username, date). Will only be saved if there is a gold correction.

  • hyphenate (string) – Optionally hyphenate to the left or right.

Return

A JSON dictionary of information about the updated Token. NB: If the hyphenation is set to left, a redirect to the new “head” token will be returned.

GET /(string: doc_id)/token-(int: doc_index).png

Returns a snippet of the original document as an image, for comparing with the OCR result.

Returns 404 if the document or token cannot be found, otherwise 200.

Parameters
  • docid (string) – The ID of the requested document.

  • index (int) – The placement of the requested Token in the document.

Query Parameters
  • leftmargin (int) – Optional left margin. See PDFToken.extract_image() for defaults. TODO

  • rightmargin (int) – Optional right margin.

  • topmargin (int) – Optional top margin.

  • bottommargin (int) – Optional bottom margin.

Return

A PNG image of the requested Token.

GET /(string: doc_id)/tokens.json

Get information about the Tokens in a given document.

Returns 404 if the document cannot be found, otherwise 200.

Parameters
  • docid (string) – The ID of the requested document.

Example response:

HTTP/1.1 200 OK
Content-Type: application/json

[
  {
    "info_url": "/<docid>/token-0.json",
    "image_url": "/<docid>/token-0.png",
    "string": "Example",
    "is_corrected": true,
    "is_discarded": false,
    "requires_annotator": false,
    "last_modified": 1605255523
  },
  {
    "info_url": "/<docid>/token-1.json",
    "image_url": "/<docid>/token-1.png",
    "string": "Exanpie",
    "is_corrected": false,
    "is_discarded": false,
    "requires_annotator": true,
    "has_error": false,
    "last_modified": null
  }
]
Response JSON Array of Objects
  • info_url (string) – URL to Token info.

  • image_url (string) – URL to Token image.

  • string (string) – Current Token string.

  • is_corrected (bool) – Whether the Token has been corrected at the moment.

  • is_discarded (bool) – Whether the Token has been discarded at the moment.

  • last_modified (bool) – The date/time when the token was last modified.

GET /random

Returns a 302-redirect to a random token from a random document. TODO: filter by needing annotator

Example response:

HTTP/1.1 302 Found
Location: /<docid>/token-<index>.json