Commands

Global Arguments

If the global arguments are not provided on the command line, CorrectOCR.ini and environment variables are checked (see CorrectOCR.config).

Workspace

These arguments configure the Workspace, ie. where the documents are located.

usage: python -m CorrectOCR [-h] [--rootPath PATH] [--originalPath PATH]
                            [--goldPath PATH] [--trainingPath PATH]
                            [--docInfoBaseURL URL] [--nheaderlines N]
                            [--language LANGUAGE]
                            [--combine_hyphenated_images [COMBINE_HYPHENATED_IMAGES]]

Named Arguments

--rootPath

Path to root of workspace

--originalPath

Path to directory of original, uncorrected docs

--goldPath

Path to directory of known correct “gold” docs

--trainingPath

Path for generated training files

--docInfoBaseURL

Base URL that serves info about documents

--nheaderlines

Number of lines in corpus headers

Default: 0

--language

Language of text

--combine_hyphenated_images

Generate joined images for hyphenated tokens

Resource

These arguments configure the ResourceManager, eg. dictionary, model, etc.

usage: python -m CorrectOCR [-h] [--resourceRootPath PATH]
                            [--hmmParamsFile FILE] [--reportFile FILE]
                            [--heuristicSettingsFile FILE]
                            [--multiCharacterErrorFile FILE]
                            [--memoizedCorrectionsFile FILE]
                            [--correctionTrackingFile FILE]
                            [--dictionaryFile FILE] [--ignoreCase]

Named Arguments

--resourceRootPath

Path to root of resources

--hmmParamsFile

Path to HMM parameters (generated from alignment docs via build_model command)

--reportFile

Path to output heuristics report (TXT file)

--heuristicSettingsFile

Path to heuristics settings (generated via make_settings command)

--multiCharacterErrorFile

Path to output multi-character error file (JSON format)

--memoizedCorrectionsFile

Path to memoizations of corrections.

--correctionTrackingFile

Path to correction tracking.

--dictionaryFile

Path to dictionary file

--ignoreCase

Use case insensitive dictionary comparisons

Default: False

Storage

These arguments configure the TokenList backend storage.

usage: python -m CorrectOCR [-h] [--type {db,fs}] [--db_driver DB_DRIVER]
                            [--db_host DB_HOST] [--db_user DB_USER]
                            [--db_password DB_PASSWORD] [--db DB]

Named Arguments

--type

Possible choices: db, fs

Storage type

--db_driver

Database hostname

--db_host

Database hostname

--db_user

Database username

--db_password

Database user password

--db

Database name

Commands

Correct OCR

usage: python -m CorrectOCR [-h] [-k K] [--force]
                            [--loglevel {CRITICAL,FATAL,ERROR,WARNING,INFO,DEBUG}]
                            [--dehyphenate DEHYPHENATE]
                            {dictionary,align,model,add,prepare,crop,stats,correct,index,cleanup,server}
                            ...

Positional Arguments

command

Possible choices: dictionary, align, model, add, prepare, crop, stats, correct, index, cleanup, server

Choose command

Named Arguments

-k

Number of k-best candidates to use for tokens (default: 4)

Default: 4

--force

Force command to run

Default: False

--loglevel

Possible choices: CRITICAL, FATAL, ERROR, WARNING, INFO, DEBUG

Log level

Default: “INFO”

--dehyphenate

Automatically mark new tokens as hyphenated if they end with a dash

Sub-commands:

dictionary

Dictionary-related commands.

python -m CorrectOCR dictionary [-h] {build,check} ...
Sub-commands:
build

Build dictionary.

Input files can be either .pdf, .txt, or .xml (in TEI format). They may be contained in .zip-files.

A corpusFile for 1800–1948 Danish is available in the workspace/resources/ directory.

It is strongly recommended to generate a large dictionary for best performance.

See CorrectOCR.dictionary for further details.

python -m CorrectOCR dictionary build [-h] [--corpusPath CORPUSPATH]
                                      [--corpusFile CORPUSFILE]
                                      [--add_annotator_gold] [--clear]
Named Arguments
--corpusPath

Directory of files to split into words and add to dictionary

--corpusFile

File containing paths and URLs to use as corpus (TXT format)

--add_annotator_gold

Add gold words from annotated tokens

Default: False

--clear

Clear the dictionary before adding words

Default: False

check

Undocumented

python -m CorrectOCR dictionary check [-h] [words [words ...]]
Positional Arguments
words

Words to check in dictionary

align

Create alignments.

The tokens of each pair of (original, gold) files are aligned in order to determine which characters and words were misread in the original and corrected in the gold.

These alignments can be used to train the model.

See CorrectOCR.aligner for further details.

python -m CorrectOCR align [-h]
                           (--docid DOCID | --docids DOCIDS [DOCIDS ...] | --all)
                           [--exclude EXCLUDE]
Named Arguments
--docid

Input document ID (filename without path or extension)

--docids

Input multiple document IDs

--all

Align all original/gold pairs

Default: False

--exclude

Doc ID to exclude (can be specified multiple times)

Default: []

model

Build model. # TODO # This is done with the aligned original/gold-documents. If none exist, an attempt will be made to create them.

The result is an HMM as described in the original paper.

See CorrectOCR.model for further details.

python -m CorrectOCR model [-h] (--build | --get_kbest GET_KBEST)
                           [--smoothingParameter N[.N]] [--other OTHER]
Named Arguments
--build

Rebuild model

Default: False

--get_kbest

Get k-best for word with current model

--smoothingParameter

Smoothing parameters for HMM

Default: 0.0001

--other

Compare against other model

add

Add documents for processing

One may add a single document directly on the command line, or provide a text file containing a list of documents.

They will be copied or downloaded to the workspace/original/ folder.

See CorrectOCR.workspace.Workspace for further details.

python -m CorrectOCR add [-h] [--documentsFile DOCUMENTSFILE]
                         [--prepare_step {tokenize,align,kbest,bin,all,server}]
                         [--max_count MAX_COUNT]
                         [document]
Positional Arguments
document

Single path/URL to document

Named Arguments
--documentsFile

File containing list of files/URLS to documents

--prepare_step

Possible choices: tokenize, align, kbest, bin, all, server

Automatically prepare added documents

--max_count

Maximum number of new documents to add from –documentsFile.

prepare

Prepare text for correction.

See CorrectOCR.workspace.Document for further details on the possible steps.

python -m CorrectOCR prepare [-h]
                             (--docid DOCID | --docids DOCIDS [DOCIDS ...] | --all | --skip_done)
                             [--exclude EXCLUDE]
                             [--step {tokenize,rehyphenate,align,kbest,bin,all,server}]
                             [--autocrop] [--precache_images]
Named Arguments
--docid

Input document ID (filename without path or extension)

--docids

Input multiple document IDs

--all

Select all documents

Default: False

--skip_done

Select only unfinished documents

Default: False

--exclude

Doc ID to exclude (can be specified multiple times)

Default: []

--step

Possible choices: tokenize, rehyphenate, align, kbest, bin, all, server

Default: “all”

--autocrop

Discard tokens near page edges

Default: False

--precache_images

Create images for the server API

Default: False

crop

Mark tokens near the edges of a page as disabled.

This may be desirable for scanned documents where the OCR has picked up partial words or sentences near the page edges.

The tokens are not discarded, merely marked disabled so they don’t show up in the correction interface or generated gold files.

If neither –edge_left nor –edge_right are provided, an attempt will be made to calculate them automatically.

python -m CorrectOCR crop [-h]
                          (--docid DOCID | --docids DOCIDS [DOCIDS ...] | --all)
                          [--edge_left EDGE_LEFT] [--edge_right EDGE_RIGHT]
Named Arguments
--docid

Input document ID (filename without path or extension)

--docids

Input multiple document IDs

--all

Prepare all original/gold pairs

Default: False

--edge_left

Set left cropping edge (in pixels)

--edge_right

Set right cropping edge (in pixels)

stats

Calculate stats about corrected documents.

The procedure is to first generate a report that shows how many tokens have been sorted into each bin. This report can then be annotated with the desired decision for each bin, and use this annotated report to generate settings for the heuristics.

See CorrectOCR.heuristics.Heuristics for further details.

python -m CorrectOCR stats [-h] (--make_report | --make_settings) [--rebin]
                           [--only_done ONLY_DONE]
Named Arguments
--make_report

Make heuristics statistics report from finished documents

Default: False

--make_settings

Make heuristics settings from report

Default: False

--rebin

Rerun kbest/bin steps to compare quality (will take longer)

Default: False

--only_done

Whether to include all or only fully annotated documents

Default: True

correct

Apply corrections

python -m CorrectOCR correct [-h] (--docid DOCID | --filePath FILEPATH)
                             (--interactive | --apply APPLY | --autocorrect | --gold_ready)
                             [--highlight]
Named Arguments
--docid

Input document ID (filename without path or extension)

--filePath

Input file path (will be copied to originalPath directory)

--interactive

Use interactive shell to input and approve suggested corrections

Default: False

--apply

Apply externally corrected token CSV to original document

--autocorrect

Apply automatic corrections as configured in settings

Default: False

--gold_ready

Apply gold from ready document

Default: False

--highlight

Create a copy with highlighted words (only available for PDFs)

Default: False

index

Generate index data

python -m CorrectOCR index [-h] (--docid DOCID | --filePath FILEPATH)
                           [--exclude EXCLUDE] --termFile TERMFILES
                           [--highlight] [--autocorrect]
Named Arguments
--docid

Input document ID (filename without path or extension)

--filePath

Input file path (will be copied to originalPath directory)

--exclude

Doc ID to exclude (can be specified multiple times)

Default: []

--termFile

File containing a string on each line, which will be matched against the tokens

Default: []

--highlight

Create a copy with highlighted words (only available for PDFs)

Default: False

--autocorrect

Apply automatic corrections as configured in settings

Default: False

cleanup

Clean up intermediate files

python -m CorrectOCR cleanup [-h] [--dryrun] [--full]
Named Arguments
--dryrun

Dont delete files, just list them

Default: False

--full

Also delete the most recent files (without .nnn. in suffix)

Default: False

server

Run basic JSON-dispensing Flask server

python -m CorrectOCR server [-h] [--host HOST] [--debug] [--profile [PROFILE]]
                            [--dynamic_images DYNAMIC_IMAGES]
                            [--redirect_hyphenated REDIRECT_HYPHENATED]
Named Arguments
--host

The host address

--debug

Runs the server in debug mode (see Flask docs)

Default: False

--profile

Use Werkzeug profiler middleware

--dynamic_images

Should images be generated dynamically?

--redirect_hyphenated

Redirect requests for hyphenated tokens to “head”