CorrectOCR.tokens module¶
- class CorrectOCR.tokens.Token(original, docid, index)[source]¶
Bases:
abc.ABC
Abstract base class. Tokens handle single words. …
- Parameters
- static register(cls)[source]¶
Decorator which registers a
Token
subclass with the base class.- Parameters
cls – Token subclass
- docid¶
The doc with which the Token is associated.
- index¶
The placement of the Token in the doc.
- gold¶
- bin: Optional[CorrectOCR.heuristics.Bin]¶
Heuristics bin.
- kbest: DefaultDict[int, CorrectOCR.model.kbest.KBestItem]¶
Dictionary of k-best suggestions for the Token. They are keyed with a numerical index starting at 1, and the values are instances of
KBestItem
.
- is_hyphenated¶
- is_discarded¶
(documented in @property methods below)
- annotations¶
A list of arbitrary key/value info about the annotations
- has_error¶
Whether the token has an unhandled error
- last_modified¶
When one of the
gold
,ìs_hyphenated
,is_discarded
, orhas_error
properties were last updated.
- cached_image_path¶
Where the image file should be cached. Is not guaranteed to exist, but can be generated via extract_image()
- abstract property page: int¶
The page of the document on which the token is located.
May not be applicable for all token types.
- Return type
- Returns
The page number.
- abstract property frame: int, int, int, int¶
The coordinates of the token’s location on the page.
Takes the form [x0, y0, x1, y1] where (x0, y0) is the top-left corner, and (x1, y1) is the bottom-right corner.
May not be applicable for all token types.
- Returns
The frame coordinates.
- class CorrectOCR.tokens.Tokenizer(language)[source]¶
Bases:
abc.ABC
Abstract base class. The Tokenizer subclasses handle extracting
Token
instances from a document.- Parameters
language (
pycountry.Language
) – The language to use for tokenization (for example, the .txt tokenizer internally uses nltk whose tokenizers function best with a language parameter).
- static register(extensions)[source]¶
Decorator which registers a
Tokenizer
subclass with the base class.
- static for_extension(ext)[source]¶
Obtain the suitable subclass for the given extension. Currently, Tokenizers are provided for the following extensions:
.txt
– plain old text..pdf
– assumes the PDF contains images and OCRed text..tiff
– will run OCR on the image and generate a PDF..png
– will run OCR on the image and generate a PDF.
- class CorrectOCR.tokens.TokenList(config, docid=None, tokens=None)[source]¶
Bases:
collections.abc.MutableSequence
- static register(storagetype)[source]¶
Decorator which registers a
TokenList
subclass with the base class.- Parameters
storagetype (
str
) – fs or db
- property stats¶
- property consolidated: Tuple[str, str, Token]¶
A consolidated iterator of tokens, where discarded tokens are skipped, and hyphenated original/gold are included.
:returns original, gold, token
- property server_ready¶
- property overview¶
Generator that returns an fast overview of the TokenList.
Each item is a dictionary containing the following keys:
doc_id
: The documentdoc_index
: The Token’s placement in the documentstring
: TODOis_corrected
: Whether the Token has a set gold propertyis_discarded
: Whether the Token is marked as discarded
- property last_modified¶