CorrectOCR.tokens module

class CorrectOCR.tokens.Token(original, docid, index)[source]

Bases: abc.ABC

Abstract base class. Tokens handle single words. …

Parameters
  • original (str) – Original spelling of the token.

  • docid (str) – The doc with which the Token is associated.

static register(cls)[source]

Decorator which registers a Token subclass with the base class.

Parameters

cls – Token subclass

docid

The doc with which the Token is associated.

index

The placement of the Token in the doc.

bin: Optional[Bin]

Heuristics bin.

kbest: DefaultDict[int, KBestItem]

Dictionary of k-best suggestions for the Token. They are keyed with a numerical index starting at 1, and the values are instances of KBestItem.

decision: Optional[str]

The decision that was made when gold was set automatically.

selection: Any

The selected automatic correction for the decision.

is_hyphenated

Whether the token is hyphenated to the following token.

is_discarded

Whether the token has been discarded (marked irrelevant by code or annotator).

abstract property token_info
Return type

Any

Returns

property original

The original spelling of the Token.

Return type

str

property gold

The corrected spelling of the Token.

Return type

str

property k

The number of k-best suggestions for the Token.

Return type

int

is_punctuation()[source]

Is the Token purely punctuation?

Return type

bool

is_numeric()[source]

Is the Token purely numeric?

Return type

bool

classmethod from_dict(d)[source]

Initialize and return a new Token with values from a dictionary.

Parameters

d (dict) – A dictionary of properties for the Token

Return type

Token

class CorrectOCR.tokens.Tokenizer(language, dehyphenate)[source]

Bases: abc.ABC

Abstract base class. The Tokenizer subclasses handle extracting Token instances from a document.

Parameters

language (pycountry.Language) – The language to use for tokenization (for example, the .txt tokenizer internally uses nltk whose tokenizers function best with a language parameter).

static register(extensions)[source]

Decorator which registers a Tokenizer subclass with the base class.

Parameters

extensions (List[str]) – List of extensions that the subclass will handle

static for_extension(ext)[source]

Obtain the suitable subclass for the given extension. Currently, Tokenizers are provided for the following extensions:

  • .txt – plain old text.

  • .pdf – assumes the PDF contains images and OCRed text.

  • .tiff – will run OCR on the image and generate a PDF.

  • .png – will run OCR on the image and generate a PDF.

Parameters

ext (str) – Filename extension (including leading period).

Return type

ABCMeta

Returns

A Tokenizer subclass.

abstract tokenize(file, storageconfig)[source]

Generate tokens for the given document.

Parameters
  • storageconfig – Storage configuration (database, filesystem) for resulting Tokens

  • file (Path) – A given document.

Return type

TokenList

Returns

abstract static apply(original, tokens, corrected)[source]
abstract static crop_tokens(original, config, tokens, edge_left=None, edge_right=None)[source]
class CorrectOCR.tokens.TokenList(config, docid=None, tokens=None)[source]

Bases: collections.abc.MutableSequence

static register(storagetype)[source]

Decorator which registers a TokenList subclass with the base class.

Parameters

storagetype (str) – fs or db

static new(config, docid=None, tokens=None)[source]
Return type

TokenList

static for_type(type)[source]
Return type

ABCMeta

insert(key, value)[source]

S.insert(index, value) – insert value before index

static exists(config, docid)[source]
Return type

bool

abstract load(docid)[source]
abstract save(token=None)[source]
property corrected_count
property discarded_count
random_token_index(has_gold=False, is_discarded=False)[source]
random_token(has_gold=False, is_discarded=False)[source]
class CorrectOCR.tokens.KBestItem(candidate, probability)[source]

Bases: tuple

Create new instance of KBestItem(candidate, probability)

property candidate

Alias for field number 0

property probability

Alias for field number 1

CorrectOCR.tokens.tokenize_str(data, language='english')[source]
Return type

List[str]