CorrectOCR.heuristics module¶

Heuristics¶

A given token and its k-best candidates are compared and checked with the dictionary. Based on this, it is matched with a bin.

bin	1	2	3	4	5	6	7	8	9
k = orig?	T	T	T	F	F	F	F	F	F
orig in dict?	T	F	F	F	F	F	T	T	T
top k-best in dict?	T	F	F	T	F	F	T	F	F
lower-ranked k-best in dict?	–	F	T	–	F	T	–	F	T

Each bin must be assigned a setting that determines what decision is made:

o / original: select the original token as correct.
k / kbest: select the top k-best candidate as correct.
d / kdict: select the first lower-ranked candidate that is in the dictionary.
a / annotator: defer selection to annotator.

Once the report and settings are generated, it is not strictly necessary to update them every single time the model is updated. It is however a good idea to do it regularly as the corpus grows and more tokens become available for the statistics.

class CorrectOCR.heuristics.Heuristics(settings, dictionary)[source]¶

Bases: object

Parameters

settings (Dict[int, str]) – A dictionary of bin number => heuristic settings.
dictionary – A dictionary for determining correctness of Tokens and suggestions.

classmethod bin(n)[source]¶

Return type: Bin

bin_for_word(original, kbest)[source]¶

bin_tokens(tokens, force=False)[source]¶

Return type: bool

add_to_report(tokens, rebin=False, hmm=None)[source]¶

report()[source]¶

Return type: str

class CorrectOCR.heuristics.Bin(description, matcher, heuristic='annotator', number=None, counts=<factory>, example=None)[source]¶

Bases: object

Heuristics bin …

TODO TABLE

description: str¶: Description of bin

matcher: Callable[[str, str, Dictionary, str], bool]¶

Function or lambda which returns True if a given CorrectOCR.tokens.Token fits into the bin, or False otherwise.

Parameters

o – Original string
k – k-best candidate string
d – Dictionary
dcode – One of ‘zerokd’, ‘somekd’, ‘allkd’ for whether zero, some, or all other k-best candidates are in dictionary

heuristic: str = 'annotator'¶

Which heuristic the bin is set up for, one of:

‘annotator’ = Defer to annotator.
‘original’ = Select original.
‘kbest’ = Select top k-best.
‘kdict’ = Select top k-best in dictionary.

number: int = None¶: The number of the bin.

counts: DefaultDict[str, int]¶: Statistics used for reporting.

example: original, gold, kbest = None¶: An example of a match, used for reporting.