CorrectOCR.heuristics module¶
Heuristics¶
A given token and its k-best candidates are compared and checked with the dictionary. Based on this, it is matched with a bin.
bin |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
---|---|---|---|---|---|---|---|---|---|
k = orig? |
T |
T |
T |
F |
F |
F |
F |
F |
F |
orig in dict? |
T |
F |
F |
F |
F |
F |
T |
T |
T |
top k-best in dict? |
T |
F |
F |
T |
F |
F |
T |
F |
F |
lower-ranked k-best in dict? |
– |
F |
T |
– |
F |
T |
– |
F |
T |
Each bin must be assigned a setting that determines what decision is made:
o
/ original: select the original token as correct.k
/ kbest: select the top k-best candidate as correct.d
/ kdict: select the first lower-ranked candidate that is in the dictionary.a
/ annotator: defer selection to annotator.
Once the report and settings are generated, it is not strictly necessary to update them every single time the model is updated. It is however a good idea to do it regularly as the corpus grows and more tokens become available for the statistics.
-
class
CorrectOCR.heuristics.
Bin
(description, matcher, heuristic='a', number=None, counts=<factory>, example=None)[source]¶ Bases:
object
Heuristics bin …
TODO TABLE
-
matcher
: Callable[[str, str, Dictionary, str], bool]¶ Function or lambda which returns True if a given
CorrectOCR.tokens.Token
fits into the bin, or False otherwise.- Parameters
o – Original string
k – k-best candidate string
d – Dictionary
dcode – One of ‘zerokd’, ‘somekd’, ‘allkd’ for whether zero, some, or all other k-best candidates are in dictionary
-
heuristic
: str = 'a'¶ Which heuristic the bin is set up for, one of:
‘a’ = Defer to annotator.
‘o’ = Select original.
‘k’ = Select top k-best.
‘d’ = Select k-best in dictionary.
-
example
: Token = None¶ An example of a matching
CorrectOCR.tokens.Token
, used for reporting.
-