CorrectOCR.tokens module¶

class CorrectOCR.tokens.Token(original, docid, index)[source]¶

Bases: abc.ABC

Abstract base class. Tokens handle single words. …

Parameters

original (str) – Original spelling of the token.
docid (str) – The doc with which the Token is associated.

static register(cls)[source]¶

Decorator which registers a Token subclass with the base class.

Parameters: cls – Token subclass

docid¶: The doc with which the Token is associated.

index¶: The placement of the Token in the doc.

gold¶

bin: Optional[CorrectOCR.heuristics.Bin]¶: Heuristics bin.

kbest: DefaultDict[int, CorrectOCR.model.kbest.KBestItem]¶: Dictionary of k-best suggestions for the Token. They are keyed with a numerical index starting at 1, and the values are instances of KBestItem.

heuristic: Optional[str]¶: The heuristic that was was determined by the bin.

selection: Any¶: The selected automatic correction for the heuristic.

is_hyphenated¶

is_discarded¶: (documented in @property methods below)

annotations¶: A list of arbitrary key/value info about the annotations

has_error¶: Whether the token has an unhandled error

last_modified¶: When one of the gold, ìs_hyphenated, is_discarded, or has_error properties were last updated.

cached_image_path¶: Where the image file should be cached. Is not guaranteed to exist, but can be generated via extract_image()

abstract property token_info: Any¶

Return type: Any
Returns

abstract property page: int¶

The page of the document on which the token is located.

May not be applicable for all token types.

Return type: int
Returns: The page number.

abstract property frame: int, int, int, int¶

The coordinates of the token’s location on the page.

Takes the form [x0, y0, x1, y1] where (x0, y0) is the top-left corner, and (x1, y1) is the bottom-right corner.

May not be applicable for all token types.

Returns: The frame coordinates.

property k: int¶

The number of k-best suggestions for the Token.

Return type: int

is_punctuation()[source]¶

Is the Token purely punctuation?

Return type: bool

is_numeric()[source]¶

Is the Token purely numeric?

Return type: bool

classmethod from_dict(d)[source]¶

Initialize and return a new Token with values from a dictionary.

Parameters: d (dict) – A dictionary of properties for the Token
Return type: Token

drop_cached_image()[source]¶

extract_image(workspace, highlight_word=True, left=300, right=300, top=15, bottom=15, force=False)[source]¶

Return type: Tuple[Path, Any]

class CorrectOCR.tokens.Tokenizer(language)[source]¶

Bases: abc.ABC

Abstract base class. The Tokenizer subclasses handle extracting Token instances from a document.

Parameters: language (pycountry.Language) – The language to use for tokenization (for example, the .txt tokenizer internally uses nltk whose tokenizers function best with a language parameter).

static register(extensions)[source]¶

Decorator which registers a Tokenizer subclass with the base class.

Parameters: extensions (List[str]) – List of extensions that the subclass will handle

static for_extension(ext)[source]¶

Obtain the suitable subclass for the given extension. Currently, Tokenizers are provided for the following extensions:

.txt – plain old text.
.pdf – assumes the PDF contains images and OCRed text.
.tiff – will run OCR on the image and generate a PDF.
.png – will run OCR on the image and generate a PDF.

Parameters: ext (str) – Filename extension (including leading period).
Return type: ABCMeta
Returns: A Tokenizer subclass.

abstract tokenize(file, storageconfig)[source]¶

Generate tokens for the given document.

Parameters

storageconfig – Storage configuration (database, filesystem) for resulting Tokens
file (Path) – A given document.

Return type

TokenList

Returns

abstract static apply(original, tokens, outfile, highlight=False)[source]¶

abstract static crop_tokens(original, config, tokens, edge_left=None, edge_right=None)[source]¶

class CorrectOCR.tokens.TokenList(config, docid=None, tokens=None)[source]¶

Bases: collections.abc.MutableSequence

static register(storagetype)[source]¶

Decorator which registers a TokenList subclass with the base class.

Parameters: storagetype (str) – fs or db

static new(config, docid=None, tokens=None)[source]¶

Return type: TokenList

static for_type(type)[source]¶

Return type: ABCMeta

insert(key, value)[source]¶: S.insert(index, value) – insert value before index

abstract load()[source]¶

abstract save(token=None)[source]¶

preload()[source]¶

flush()[source]¶

property stats¶

classmethod validate_stats(docid, stats)[source]¶

property consolidated: Tuple[str, str, Token]¶

A consolidated iterator of tokens, where discarded tokens are skipped, and hyphenated original/gold are included.

:returns original, gold, token

Return type: Tuple[str, str, Token]

property server_ready¶

random_token_index(has_gold=False, is_discarded=False)[source]¶

random_token(has_gold=False, is_discarded=False)[source]¶

property overview¶

Generator that returns an fast overview of the TokenList.

Each item is a dictionary containing the following keys:

doc_id: The document

doc_index: The Token’s placement in the document

string: TODO

is_corrected: Whether the Token has a set gold property

is_discarded: Whether the Token is marked as discarded

property last_modified¶

dehyphenate()[source]¶

CorrectOCR.tokens.tokenize_str(data, language='english')[source]¶

Return type: List[str]