CorrectOCR.tokens module

class CorrectOCR.tokens.Token(original, docid, index)[source]

Bases: abc.ABC

Abstract base class. Tokens handle single words. …

Parameters
  • original (str) – Original spelling of the token.

  • docid (str) – The doc with which the Token is associated.

static register(cls)[source]

Decorator which registers a Token subclass with the base class.

Parameters

cls – Token subclass

docid

The doc with which the Token is associated.

index

The placement of the Token in the doc.

gold
bin: Optional[CorrectOCR.heuristics.Bin]

Heuristics bin.

kbest: DefaultDict[int, CorrectOCR.model.kbest.KBestItem]

Dictionary of k-best suggestions for the Token. They are keyed with a numerical index starting at 1, and the values are instances of KBestItem.

heuristic: Optional[str]

The heuristic that was was determined by the bin.

selection: Any

The selected automatic correction for the heuristic.

is_hyphenated
is_discarded

(documented in @property methods below)

annotations

A list of arbitrary key/value info about the annotations

has_error

Whether the token has an unhandled error

last_modified

When one of the gold, ìs_hyphenated, is_discarded, or has_error properties were last updated.

cached_image_path

Where the image file should be cached. Is not guaranteed to exist, but can be generated via extract_image()

abstract property token_info: Any
Return type

Any

Returns

abstract property page: int

The page of the document on which the token is located.

May not be applicable for all token types.

Return type

int

Returns

The page number.

abstract property frame: int, int, int, int

The coordinates of the token’s location on the page.

Takes the form [x0, y0, x1, y1] where (x0, y0) is the top-left corner, and (x1, y1) is the bottom-right corner.

May not be applicable for all token types.

Returns

The frame coordinates.

property k: int

The number of k-best suggestions for the Token.

Return type

int

is_punctuation()[source]

Is the Token purely punctuation?

Return type

bool

is_numeric()[source]

Is the Token purely numeric?

Return type

bool

classmethod from_dict(d)[source]

Initialize and return a new Token with values from a dictionary.

Parameters

d (dict) – A dictionary of properties for the Token

Return type

Token

drop_cached_image()[source]
extract_image(workspace, highlight_word=True, left=300, right=300, top=15, bottom=15, force=False)[source]
Return type

Tuple[Path, Any]

class CorrectOCR.tokens.Tokenizer(language)[source]

Bases: abc.ABC

Abstract base class. The Tokenizer subclasses handle extracting Token instances from a document.

Parameters

language (pycountry.Language) – The language to use for tokenization (for example, the .txt tokenizer internally uses nltk whose tokenizers function best with a language parameter).

static register(extensions)[source]

Decorator which registers a Tokenizer subclass with the base class.

Parameters

extensions (List[str]) – List of extensions that the subclass will handle

static for_extension(ext)[source]

Obtain the suitable subclass for the given extension. Currently, Tokenizers are provided for the following extensions:

  • .txt – plain old text.

  • .pdf – assumes the PDF contains images and OCRed text.

  • .tiff – will run OCR on the image and generate a PDF.

  • .png – will run OCR on the image and generate a PDF.

Parameters

ext (str) – Filename extension (including leading period).

Return type

ABCMeta

Returns

A Tokenizer subclass.

abstract tokenize(file, storageconfig)[source]

Generate tokens for the given document.

Parameters
  • storageconfig – Storage configuration (database, filesystem) for resulting Tokens

  • file (Path) – A given document.

Return type

TokenList

Returns

abstract static apply(original, tokens, outfile, highlight=False)[source]
abstract static crop_tokens(original, config, tokens, edge_left=None, edge_right=None)[source]
class CorrectOCR.tokens.TokenList(config, docid=None, tokens=None)[source]

Bases: collections.abc.MutableSequence

static register(storagetype)[source]

Decorator which registers a TokenList subclass with the base class.

Parameters

storagetype (str) – fs or db

static new(config, docid=None, tokens=None)[source]
Return type

TokenList

static for_type(type)[source]
Return type

ABCMeta

insert(key, value)[source]

S.insert(index, value) – insert value before index

abstract load()[source]
abstract save(token=None)[source]
preload()[source]
flush()[source]
property stats
classmethod validate_stats(docid, stats)[source]
property consolidated: Tuple[str, str, Token]

A consolidated iterator of tokens, where discarded tokens are skipped, and hyphenated original/gold are included.

:returns original, gold, token

Return type

Tuple[str, str, Token]

property server_ready
random_token_index(has_gold=False, is_discarded=False)[source]
random_token(has_gold=False, is_discarded=False)[source]
property overview

Generator that returns an fast overview of the TokenList.

Each item is a dictionary containing the following keys:

  • doc_id: The document

  • doc_index: The Token’s placement in the document

  • string: TODO

  • is_corrected: Whether the Token has a set gold property

  • is_discarded: Whether the Token is marked as discarded

property last_modified
dehyphenate()[source]
CorrectOCR.tokens.tokenize_str(data, language='english')[source]
Return type

List[str]