Commands¶

Commands and their arguments are called directly on the module, like so:

python -m CorrectOCR [command] [args...]

The following commands are available:

build_dictionary creates a dictionary. Input files can be either .pdf, .txt, or .xml (in TEI format). They may be contained in .zip-files.
- The --corpusPath option specifies a directory of files.
- The --corpusFile option specifies a file containing paths and URLs. One such file for a dictionary covering 1800–1948 Danish is provided under resources/.
- The --clear option clears the dictionary before adding words (the file is backed up first).
It is strongly recommended to generate a large dictionary for best performance.
align aligns a pair of (original, gold) files in order to determine which characters and words were misread in the original and corrected in the gold.
- The --fileid option specifies a single pair of files to align.
- The --all option aligns all available pairs. Can be combined with --exclude to skip specific files.
build_model uses the alignments to create parameters for the HMM.
- The --smoothingParameter option can be adjusted as needed.
add copies or downloads files to the workspace. One may provide about a single file directly, or use the option to provide a list of files.
- The --documents option specifies a file containing paths and URLs.
- The --max_count option specifies the maximum number of files to add.
- The --prepare_step option allows the automatic preparation of the files as they are added. See below.
prepare tokenizes and prepare texts for corrections.
- The --fileid option specifies which file to tokenize.
- The --all option tokenizes all available texts. Can be combined with --exclude to skip specific files.
- The --step option specifies how many of the processing steps to take. The default is to take all steps.
  - tokenize simply splits the text into tokens (words).
  - align aligns tokens with gold versions, if these exist.
  - kbest calculates k-best correction candidates for each token via the HMM.
  - bin sorts the tokens into bins according to the heuristics below.
  Each of the steps includes the previous step, and will save intermediary information about each token to CSV or a databases.
stats is used to configure which decisions the program should make about each bin of tokens:
- --make_report generates a statistical report on whether originals/k-best equal are in the dictionary, etc. This report can then be inspected and annotated with the desired decision for each bin.
- --make_settings creates correction settings based on the annotated report.
correct uses the settings to sort the tokens into bins and makes automated decisions as configured.
- The --fileid option specifies which file to correct.
There are three ways to run corrections:
- --interactive runs an interactive correction CLI for the remaining undecided tokens (see Correction Interface below).
- --apply takes a path argument to an edited token CSV file and applies the corrections therein.
- --autocorrect applies available corrections as configured in correction settings (ie. any heuristic bins not marked for human annotation).
index finds specified terms for use in index-generation.
- The --fileid option specifies a single file for which to generate an index.
- The --all option generates indices for all available files. Can be combined with --exclude to skip specific files.
- The --termFile option specifies a text file containing a word on each line, which will be matched against the tokens. The option may be repeated, and each filename (without extension) will be used as markers for the string.
- The --highlight option will create a copy of the input files with highlighted words (only available for PDFs).
- The --autocorrect option applies available corrections prior to search/highlighting, as above.
server starts a simple Flask backend server that provides JSON descriptions and .png images of tokens, as well as accepts POST-requests to update tokens with corrections.
cleanup deletes the backup files in the training directory.
- The --dryrun option simply lists the files without actually deleting them.
- The --full option also deletes the current files (ie. those without .nnn. in their suffix).