CorrectOCR.tokens module¶
-
class
CorrectOCR.tokens.
Token
(original, docid, index)[source]¶ Bases:
abc.ABC
Abstract base class. Tokens handle single words. …
- Parameters
-
static
register
(cls)[source]¶ Decorator which registers a
Token
subclass with the base class.- Parameters
cls – Token subclass
-
docid
¶ The doc with which the Token is associated.
-
index
¶ The placement of the Token in the doc.
-
bin
: Optional[Bin]¶ Heuristics bin.
-
kbest
: DefaultDict[int, KBestItem]¶ Dictionary of k-best suggestions for the Token. They are keyed with a numerical index starting at 1, and the values are instances of
KBestItem
.
-
is_hyphenated
¶ Whether the token is hyphenated to the following token.
-
is_discarded
¶ Whether the token has been discarded (marked irrelevant by code or annotator).
-
class
CorrectOCR.tokens.
Tokenizer
(language, dehyphenate)[source]¶ Bases:
abc.ABC
Abstract base class. The Tokenizer subclasses handle extracting
Token
instances from a document.- Parameters
language (
pycountry.Language
) – The language to use for tokenization (for example, the .txt tokenizer internally uses nltk whose tokenizers function best with a language parameter).
-
static
register
(extensions)[source]¶ Decorator which registers a
Tokenizer
subclass with the base class.
-
static
for_extension
(ext)[source]¶ Obtain the suitable subclass for the given extension. Currently, Tokenizers are provided for the following extensions:
.txt
– plain old text..pdf
– assumes the PDF contains images and OCRed text..tiff
– will run OCR on the image and generate a PDF..png
– will run OCR on the image and generate a PDF.
-
class
CorrectOCR.tokens.
TokenList
(config, docid=None, tokens=None)[source]¶ Bases:
collections.abc.MutableSequence
-
static
register
(storagetype)[source]¶ Decorator which registers a
TokenList
subclass with the base class.- Parameters
storagetype (
str
) – fs or db
-
property
corrected_count
¶
-
property
discarded_count
¶
-
static