
Global Arguments

If the global arguments are not provided on the command line, CorrectOCR.ini and environment variables are checked (see CorrectOCR.config).


These arguments configure the Workspace, ie. where the documents are located.

usage: python -m CorrectOCR [-h] [--rootPath PATH] [--originalPath PATH]
                            [--goldPath PATH] [--trainingPath PATH]
                            [--docInfoBaseURL URL] [--nheaderlines N]
                            [--language LANGUAGE]
                            [--combine_hyphenated_images [COMBINE_HYPHENATED_IMAGES]]

Named Arguments


Path to root of workspace


Path to directory of original, uncorrected docs


Path to directory of known correct “gold” docs


Path for generated training files


Base URL that serves info about documents


Number of lines in corpus headers

Default: 0


Language of text


Generate joined images for hyphenated tokens


These arguments configure the ResourceManager, eg. dictionary, model, etc.

usage: python -m CorrectOCR [-h] [--resourceRootPath PATH]
                            [--hmmParamsFile FILE] [--reportFile FILE]
                            [--heuristicSettingsFile FILE]
                            [--multiCharacterErrorFile FILE]
                            [--memoizedCorrectionsFile FILE]
                            [--correctionTrackingFile FILE]
                            [--dictionaryFile FILE] [--ignoreCase]

Named Arguments


Path to root of resources


Path to HMM parameters (generated from alignment docs via build_model command)


Path to output heuristics report (TXT file)


Path to heuristics settings (generated via make_settings command)


Path to output multi-character error file (JSON format)


Path to memoizations of corrections.


Path to correction tracking.


Path to dictionary file


Use case insensitive dictionary comparisons

Default: False


These arguments configure the TokenList backend storage.

usage: python -m CorrectOCR [-h] [--type {db,fs}] [--db_driver DB_DRIVER]
                            [--db_host DB_HOST] [--db_user DB_USER]
                            [--db_password DB_PASSWORD] [--db DB]

Named Arguments


Possible choices: db, fs

Storage type


Database hostname


Database hostname


Database username


Database user password


Database name


Correct OCR

usage: python -m CorrectOCR [-h] [-k K] [--force]
                            [--loglevel {CRITICAL,FATAL,ERROR,WARNING,INFO,DEBUG}]
                            [--dehyphenate DEHYPHENATE]

Positional Arguments


Possible choices: dictionary, align, model, add, prepare, crop, stats, correct, index, cleanup, server

Choose command

Named Arguments


Number of k-best candidates to use for tokens (default: 4)

Default: 4


Force command to run

Default: False



Log level

Default: “INFO”


Automatically mark new tokens as hyphenated if they end with a dash



Dictionary-related commands.

python -m CorrectOCR dictionary [-h] {build,check} ...

Build dictionary.

Input files can be either .pdf, .txt, or .xml (in TEI format). They may be contained in .zip-files.

A corpusFile for 1800–1948 Danish is available in the workspace/resources/ directory.

It is strongly recommended to generate a large dictionary for best performance.

See CorrectOCR.dictionary for further details.

python -m CorrectOCR dictionary build [-h] [--corpusPath CORPUSPATH]
                                      [--corpusFile CORPUSFILE]
                                      [--add_annotator_gold] [--clear]
Named Arguments

Directory of files to split into words and add to dictionary


File containing paths and URLs to use as corpus (TXT format)


Add gold words from annotated tokens

Default: False


Clear the dictionary before adding words

Default: False



python -m CorrectOCR dictionary check [-h] [words [words ...]]
Positional Arguments

Words to check in dictionary


Create alignments.

The tokens of each pair of (original, gold) files are aligned in order to determine which characters and words were misread in the original and corrected in the gold.

These alignments can be used to train the model.

See CorrectOCR.aligner for further details.

python -m CorrectOCR align [-h]
                           (--docid DOCID | --docids DOCIDS [DOCIDS ...] | --all)
                           [--exclude EXCLUDE]
Named Arguments

Input document ID (filename without path or extension)


Input multiple document IDs


Align all original/gold pairs

Default: False


Doc ID to exclude (can be specified multiple times)

Default: []


Build model. # TODO # This is done with the aligned original/gold-documents. If none exist, an attempt will be made to create them.

The result is an HMM as described in the original paper.

See CorrectOCR.model for further details.

python -m CorrectOCR model [-h] (--build | --get_kbest GET_KBEST)
                           [--smoothingParameter N[.N]] [--other OTHER]
Named Arguments

Rebuild model

Default: False


Get k-best for word with current model


Smoothing parameters for HMM

Default: 0.0001


Compare against other model


Add documents for processing

One may add a single document directly on the command line, or provide a text file containing a list of documents.

They will be copied or downloaded to the workspace/original/ folder.

See CorrectOCR.workspace.Workspace for further details.

python -m CorrectOCR add [-h] [--documentsFile DOCUMENTSFILE]
                         [--prepare_step {tokenize,align,kbest,bin,all,server}]
                         [--max_count MAX_COUNT]
Positional Arguments

Single path/URL to document

Named Arguments

File containing list of files/URLS to documents


Possible choices: tokenize, align, kbest, bin, all, server

Automatically prepare added documents


Maximum number of new documents to add from –documentsFile.


Prepare text for correction.

See CorrectOCR.workspace.Document for further details on the possible steps.

python -m CorrectOCR prepare [-h]
                             (--docid DOCID | --docids DOCIDS [DOCIDS ...] | --all | --skip_done)
                             [--exclude EXCLUDE]
                             [--step {tokenize,rehyphenate,align,kbest,bin,all,server}]
                             [--autocrop] [--precache_images]
Named Arguments

Input document ID (filename without path or extension)


Input multiple document IDs


Select all documents

Default: False


Select only unfinished documents

Default: False


Doc ID to exclude (can be specified multiple times)

Default: []


Possible choices: tokenize, rehyphenate, align, kbest, bin, all, server

Default: “all”


Discard tokens near page edges

Default: False


Create images for the server API

Default: False


Mark tokens near the edges of a page as disabled.

This may be desirable for scanned documents where the OCR has picked up partial words or sentences near the page edges.

The tokens are not discarded, merely marked disabled so they don’t show up in the correction interface or generated gold files.

If neither –edge_left nor –edge_right are provided, an attempt will be made to calculate them automatically.

python -m CorrectOCR crop [-h]
                          (--docid DOCID | --docids DOCIDS [DOCIDS ...] | --all)
                          [--edge_left EDGE_LEFT] [--edge_right EDGE_RIGHT]
Named Arguments

Input document ID (filename without path or extension)


Input multiple document IDs


Prepare all original/gold pairs

Default: False


Set left cropping edge (in pixels)


Set right cropping edge (in pixels)


Calculate stats about corrected documents.

The procedure is to first generate a report that shows how many tokens have been sorted into each bin. This report can then be annotated with the desired decision for each bin, and use this annotated report to generate settings for the heuristics.

See CorrectOCR.heuristics.Heuristics for further details.

python -m CorrectOCR stats [-h] (--make_report | --make_settings) [--rebin]
                           [--only_done ONLY_DONE]
Named Arguments

Make heuristics statistics report from finished documents

Default: False


Make heuristics settings from report

Default: False


Rerun kbest/bin steps to compare quality (will take longer)

Default: False


Whether to include all or only fully annotated documents

Default: True


Apply corrections

python -m CorrectOCR correct [-h] (--docid DOCID | --filePath FILEPATH)
                             (--interactive | --apply APPLY | --autocorrect | --gold_ready)
Named Arguments

Input document ID (filename without path or extension)


Input file path (will be copied to originalPath directory)


Use interactive shell to input and approve suggested corrections

Default: False


Apply externally corrected token CSV to original document


Apply automatic corrections as configured in settings

Default: False


Apply gold from ready document

Default: False


Create a copy with highlighted words (only available for PDFs)

Default: False


Generate index data

python -m CorrectOCR index [-h] (--docid DOCID | --filePath FILEPATH)
                           [--exclude EXCLUDE] --termFile TERMFILES
                           [--highlight] [--autocorrect]
Named Arguments

Input document ID (filename without path or extension)


Input file path (will be copied to originalPath directory)


Doc ID to exclude (can be specified multiple times)

Default: []


File containing a string on each line, which will be matched against the tokens

Default: []


Create a copy with highlighted words (only available for PDFs)

Default: False


Apply automatic corrections as configured in settings

Default: False


Clean up intermediate files

python -m CorrectOCR cleanup [-h] [--dryrun] [--full]
Named Arguments

Dont delete files, just list them

Default: False


Also delete the most recent files (without .nnn. in suffix)

Default: False


Run basic JSON-dispensing Flask server

python -m CorrectOCR server [-h] [--host HOST] [--debug] [--profile [PROFILE]]
                            [--dynamic_images DYNAMIC_IMAGES]
                            [--redirect_hyphenated REDIRECT_HYPHENATED]
Named Arguments

The host address


Runs the server in debug mode (see Flask docs)

Default: False


Use Werkzeug profiler middleware


Should images be generated dynamically?


Redirect requests for hyphenated tokens to “head”