CorrectOCR.tokens module¶
- class CorrectOCR.tokens.Token(original, docid, index)[source]¶
Bases:
abc.ABCAbstract base class. Tokens handle single words. …
- Parameters
- static register(cls)[source]¶
Decorator which registers a
Tokensubclass with the base class.- Parameters
cls – Token subclass
- docid¶
The doc with which the Token is associated.
- index¶
The placement of the Token in the doc.
- gold¶
- bin: Optional[CorrectOCR.heuristics.Bin]¶
Heuristics bin.
- kbest: DefaultDict[int, CorrectOCR.model.kbest.KBestItem]¶
Dictionary of k-best suggestions for the Token. They are keyed with a numerical index starting at 1, and the values are instances of
KBestItem.
- is_hyphenated¶
- is_discarded¶
(documented in @property methods below)
- annotations¶
A list of arbitrary key/value info about the annotations
- has_error¶
Whether the token has an unhandled error
- last_modified¶
When one of the
gold,ìs_hyphenated,is_discarded, orhas_errorproperties were last updated.
- cached_image_path¶
Where the image file should be cached. Is not guaranteed to exist, but can be generated via extract_image()
- abstract property page: int¶
The page of the document on which the token is located.
May not be applicable for all token types.
- Return type
- Returns
The page number.
- abstract property frame: int, int, int, int¶
The coordinates of the token’s location on the page.
Takes the form [x0, y0, x1, y1] where (x0, y0) is the top-left corner, and (x1, y1) is the bottom-right corner.
May not be applicable for all token types.
- Returns
The frame coordinates.
- class CorrectOCR.tokens.Tokenizer(language)[source]¶
Bases:
abc.ABCAbstract base class. The Tokenizer subclasses handle extracting
Tokeninstances from a document.- Parameters
language (
pycountry.Language) – The language to use for tokenization (for example, the .txt tokenizer internally uses nltk whose tokenizers function best with a language parameter).
- static register(extensions)[source]¶
Decorator which registers a
Tokenizersubclass with the base class.
- static for_extension(ext)[source]¶
Obtain the suitable subclass for the given extension. Currently, Tokenizers are provided for the following extensions:
.txt– plain old text..pdf– assumes the PDF contains images and OCRed text..tiff– will run OCR on the image and generate a PDF..png– will run OCR on the image and generate a PDF.
- class CorrectOCR.tokens.TokenList(config, docid=None, tokens=None)[source]¶
Bases:
collections.abc.MutableSequence- static register(storagetype)[source]¶
Decorator which registers a
TokenListsubclass with the base class.- Parameters
storagetype (
str) – fs or db
- property stats¶
- property consolidated: Tuple[str, str, Token]¶
A consolidated iterator of tokens, where discarded tokens are skipped, and hyphenated original/gold are included.
:returns original, gold, token
- property server_ready¶
- property overview¶
Generator that returns an fast overview of the TokenList.
Each item is a dictionary containing the following keys:
doc_id: The documentdoc_index: The Token’s placement in the documentstring: TODOis_corrected: Whether the Token has a set gold propertyis_discarded: Whether the Token is marked as discarded
- property last_modified¶