CorrectOCR.model module

class CorrectOCR.model.HMM(path, multichars=None, dictionary=None)[source]

Bases: object

Parameters
  • path (Path) – Path for loading and saving.

  • multichars – A dictionary of possible multicharacter substitutions (eg. ‘cr’: ‘æ’ or vice versa).

  • dictionary (Optional[Dictionary]) – The dictionary against which to check validity.

property init

Initial probabilities.

Return type

DefaultDict[str, float]

property tran

Transition probabilities.

Return type

DefaultDict[str, DefaultDict[str, float]]

property emis

Emission probabilities.

Return type

DefaultDict[str, DefaultDict[str, float]]

save(path=None)[source]

Save the HMM parameters.

Parameters

path (Optional[Path]) – Optional new path to save to.

is_valid()[source]

Verify that parameters are valid (ie. the keys in init/tran/emis match).

Return type

bool

viterbi(char_seq)[source]

TODO

Parameters

char_seq (Sequence[str]) –

Return type

str

Returns

kbest_for_word(word, k)[source]

Generates k-best correction candidates for a single word.

Parameters
  • word (str) – The word for which to generate candidates

  • k (int) – How many candidates to generate.

Return type

DefaultDict[int, KBestItem]

Returns

A dictionary with ranked candidates keyed by 1..*k*.

generate_kbest(tokens, k=4, force=False)[source]

Generates k-best correction candidates for a list of Tokens and adds them to each token.

Parameters
  • tokens (TokenList) – List of tokens.

  • k (int) – How many candidates to generate.

class CorrectOCR.model.HMMBuilder(dictionary, smoothingParameter, characterSet, readCounts, remove_chars, gold_words)[source]

Bases: object

Calculates parameters for a HMM based on the input. They can be accessed via the three properties.

Parameters
  • dictionary (Dictionary) – The dictionary to use for generating probabilities.

  • smoothingParameter (float) – Lower bound for probabilities.

  • characterSet – Set of required characters for the final HMM.

  • readCounts – See Aligner.

  • remove_chars (List[str]) – List of characters to remove from the final HMM.

  • gold_words (List[str]) – List of known correct words.

emis: DefaultDict[str, float]

Emission probabilities.

init: DefaultDict[str, float]

Initial probabilities.

tran: DefaultDict[str, DefaultDict[str, float]]

Transition probabilities.