CorrectOCR.heuristics module

Heuristics

A given token and its k-best candidates are compared and checked with the dictionary. Based on this, it is matched with a bin.

bin

1

2

3

4

5

6

7

8

9

k = orig?

T

T

T

F

F

F

F

F

F

orig in dict?

T

F

F

F

F

F

T

T

T

top k-best in dict?

T

F

F

T

F

F

T

F

F

lower-ranked k-best in dict?

F

T

F

T

F

T

Each bin must be assigned a setting that determines what decision is made:

  • o / original: select the original token as correct.

  • k / kbest: select the top k-best candidate as correct.

  • d / kdict: select the first lower-ranked candidate that is in the dictionary.

  • a / annotator: defer selection to annotator.

Once the report and settings are generated, it is not strictly necessary to update them every single time the model is updated. It is however a good idea to do it regularly as the corpus grows and more tokens become available for the statistics.

class CorrectOCR.heuristics.Heuristics(settings, dictionary)[source]

Bases: object

Parameters
  • settings (Dict[int, str]) – A dictionary of bin number => heuristic settings.

  • dictionary – A dictionary for determining correctness of Tokens and suggestions.

classmethod bin(n)[source]
Return type

Bin

bin_for_word(original, kbest)[source]
bin_tokens(tokens, force=False)[source]
Return type

bool

add_to_report(tokens, rebin=False, hmm=None)[source]
report()[source]
Return type

str

class CorrectOCR.heuristics.Bin(description, matcher, heuristic='annotator', number=None, counts=<factory>, example=None)[source]

Bases: object

Heuristics bin …

TODO TABLE

description: str

Description of bin

matcher: Callable[[str, str, Dictionary, str], bool]

Function or lambda which returns True if a given CorrectOCR.tokens.Token fits into the bin, or False otherwise.

Parameters
  • o – Original string

  • kk-best candidate string

  • d – Dictionary

  • dcode – One of ‘zerokd’, ‘somekd’, ‘allkd’ for whether zero, some, or all other k-best candidates are in dictionary

heuristic: str = 'annotator'

Which heuristic the bin is set up for, one of:

  • ‘annotator’ = Defer to annotator.

  • ‘original’ = Select original.

  • ‘kbest’ = Select top k-best.

  • ‘kdict’ = Select top k-best in dictionary.

number: int = None

The number of the bin.

counts: DefaultDict[str, int]

Statistics used for reporting.

example: original, gold, kbest = None

An example of a match, used for reporting.