CorrectOCR.tokens module¶
-
class
CorrectOCR.tokens.Token(original, docid, index)[source]¶ Bases:
abc.ABCAbstract base class. Tokens handle single words. …
- Parameters
-
static
register(cls)[source]¶ Decorator which registers a
Tokensubclass with the base class.- Parameters
cls – Token subclass
-
docid¶ The doc with which the Token is associated.
-
index¶ The placement of the Token in the doc.
-
bin: Optional[Bin]¶ Heuristics bin.
-
kbest: DefaultDict[int, KBestItem]¶ Dictionary of k-best suggestions for the Token. They are keyed with a numerical index starting at 1, and the values are instances of
KBestItem.
-
is_hyphenated¶ Whether the token is hyphenated to the following token.
-
is_discarded¶ Whether the token has been discarded (marked irrelevant by code or annotator).
-
class
CorrectOCR.tokens.Tokenizer(language, dehyphenate)[source]¶ Bases:
abc.ABCAbstract base class. The Tokenizer subclasses handle extracting
Tokeninstances from a document.- Parameters
language (
pycountry.Language) – The language to use for tokenization (for example, the .txt tokenizer internally uses nltk whose tokenizers function best with a language parameter).
-
static
register(extensions)[source]¶ Decorator which registers a
Tokenizersubclass with the base class.
-
static
for_extension(ext)[source]¶ Obtain the suitable subclass for the given extension. Currently, Tokenizers are provided for the following extensions:
.txt– plain old text..pdf– assumes the PDF contains images and OCRed text..tiff– will run OCR on the image and generate a PDF..png– will run OCR on the image and generate a PDF.
-
class
CorrectOCR.tokens.TokenList(config, docid=None, tokens=None)[source]¶ Bases:
collections.abc.MutableSequence-
static
register(storagetype)[source]¶ Decorator which registers a
TokenListsubclass with the base class.- Parameters
storagetype (
str) – fs or db
-
property
corrected_count¶
-
property
discarded_count¶
-
static