CorrectOCR.tokens module¶

class CorrectOCR.tokens.Token(original, docid, index)[source]¶

Bases: abc.ABC

Abstract base class. Tokens handle single words. …

Parameters

original (str) – Original spelling of the token.
docid (str) – The doc with which the Token is associated.

static register(cls)[source]¶

Decorator which registers a Token subclass with the base class.

Parameters: cls – Token subclass

docid¶: The doc with which the Token is associated.

index¶: The placement of the Token in the doc.

bin: Optional[Bin]¶: Heuristics bin.

kbest: DefaultDict[int, KBestItem]¶: Dictionary of k-best suggestions for the Token. They are keyed with a numerical index starting at 1, and the values are instances of KBestItem.

decision: Optional[str]¶: The decision that was made when gold was set automatically.

selection: Any¶: The selected automatic correction for the decision.

is_hyphenated¶: Whether the token is hyphenated to the following token.

is_discarded¶: Whether the token has been discarded (marked irrelevant by code or annotator).

abstract property token_info¶

Return type: Any
Returns

property original¶

The original spelling of the Token.

Return type: str

property gold¶

The corrected spelling of the Token.

Return type: str

property k¶

The number of k-best suggestions for the Token.

Return type: int

is_punctuation()[source]¶

Is the Token purely punctuation?

Return type: bool

is_numeric()[source]¶

Is the Token purely numeric?

Return type: bool

classmethod from_dict(d)[source]¶

Initialize and return a new Token with values from a dictionary.

Parameters: d (dict) – A dictionary of properties for the Token
Return type: Token

class CorrectOCR.tokens.Tokenizer(language, dehyphenate)[source]¶

Bases: abc.ABC

Abstract base class. The Tokenizer subclasses handle extracting Token instances from a document.

Parameters: language (pycountry.Language) – The language to use for tokenization (for example, the .txt tokenizer internally uses nltk whose tokenizers function best with a language parameter).

static register(extensions)[source]¶

Decorator which registers a Tokenizer subclass with the base class.

Parameters: extensions (List[str]) – List of extensions that the subclass will handle

static for_extension(ext)[source]¶

Obtain the suitable subclass for the given extension. Currently, Tokenizers are provided for the following extensions:

.txt – plain old text.
.pdf – assumes the PDF contains images and OCRed text.
.tiff – will run OCR on the image and generate a PDF.
.png – will run OCR on the image and generate a PDF.

Parameters: ext (str) – Filename extension (including leading period).
Return type: ABCMeta
Returns: A Tokenizer subclass.

abstract tokenize(file, storageconfig)[source]¶

Generate tokens for the given document.

Parameters

storageconfig – Storage configuration (database, filesystem) for resulting Tokens
file (Path) – A given document.

Return type

TokenList

Returns

abstract static apply(original, tokens, corrected)[source]¶

abstract static crop_tokens(original, config, tokens, edge_left=None, edge_right=None)[source]¶

class CorrectOCR.tokens.TokenList(config, docid=None, tokens=None)[source]¶

Bases: collections.abc.MutableSequence

static register(storagetype)[source]¶

Decorator which registers a TokenList subclass with the base class.

Parameters: storagetype (str) – fs or db

static new(config, docid=None, tokens=None)[source]¶

Return type: TokenList

static for_type(type)[source]¶

Return type: ABCMeta

insert(key, value)[source]¶: S.insert(index, value) – insert value before index

static exists(config, docid)[source]¶

Return type: bool

abstract load(docid)[source]¶

abstract save(token=None)[source]¶

property corrected_count¶

property discarded_count¶

random_token_index(has_gold=False, is_discarded=False)[source]¶

random_token(has_gold=False, is_discarded=False)[source]¶

class CorrectOCR.tokens.KBestItem(candidate, probability)[source]¶

Bases: tuple

Create new instance of KBestItem(candidate, probability)

property candidate¶: Alias for field number 0

property probability¶: Alias for field number 1

CorrectOCR.tokens.tokenize_str(data, language='english')[source]¶

Return type: List[str]