Commands¶
Commands and their arguments are called directly on the module, like so:
python -m CorrectOCR [command] [args...]
The following commands are available:
build_dictionary
creates a dictionary. Input files can be either.pdf
,.txt
, or.xml
(in TEI format). They may be contained in.zip
-files.The
--corpusPath
option specifies a directory of files.The
--corpusFile
option specifies a file containing paths and URLs. One such file for a dictionary covering 1800–1948 Danish is provided underresources/
.The
--clear
option clears the dictionary before adding words (the file is backed up first).
It is strongly recommended to generate a large dictionary for best performance.
align
aligns a pair of (original, gold) files in order to determine which characters and words were misread in the original and corrected in the gold.The
--fileid
option specifies a single pair of files to align.The
--all
option aligns all available pairs. Can be combined with--exclude
to skip specific files.
build_model
uses the alignments to create parameters for the HMM.The
--smoothingParameter
option can be adjusted as needed.
add
copies or downloads files to the workspace. One may provide about a single file directly, or use the option to provide a list of files.The
--documents
option specifies a file containing paths and URLs.The
--max_count
option specifies the maximum number of files to add.The
--prepare_step
option allows the automatic preparation of the files as they are added. See below.
prepare
tokenizes and prepare texts for corrections.The
--fileid
option specifies which file to tokenize.The
--all
option tokenizes all available texts. Can be combined with--exclude
to skip specific files.The
--step
option specifies how many of the processing steps to take. The default is to take all steps.tokenize
simply splits the text into tokens (words).align
aligns tokens with gold versions, if these exist.kbest
calculates k-best correction candidates for each token via the HMM.bin
sorts the tokens into bins according to the heuristics below.
Each of the steps includes the previous step, and will save intermediary information about each token to CSV or a databases.
stats
is used to configure which decisions the program should make about each bin of tokens:--make_report
generates a statistical report on whether originals/k-best equal are in the dictionary, etc. This report can then be inspected and annotated with the desired decision for each bin.--make_settings
creates correction settings based on the annotated report.
correct
uses the settings to sort the tokens into bins and makes automated decisions as configured.The
--fileid
option specifies which file to correct.
There are three ways to run corrections:
--interactive
runs an interactive correction CLI for the remaining undecided tokens (see Correction Interface below).--apply
takes a path argument to an edited token CSV file and applies the corrections therein.--autocorrect
applies available corrections as configured in correction settings (ie. any heuristic bins not marked for human annotation).
index
finds specified terms for use in index-generation.The
--fileid
option specifies a single file for which to generate an index.The
--all
option generates indices for all available files. Can be combined with--exclude
to skip specific files.The
--termFile
option specifies a text file containing a word on each line, which will be matched against the tokens. The option may be repeated, and each filename (without extension) will be used as markers for the string.The
--highlight
option will create a copy of the input files with highlighted words (only available for PDFs).The
--autocorrect
option applies available corrections prior to search/highlighting, as above.
server
starts a simple Flask backend server that providesJSON
descriptions and.png
images of tokens, as well as acceptsPOST
-requests to update tokens with corrections.cleanup
deletes the backup files in the training directory.The
--dryrun
option simply lists the files without actually deleting them.The
--full
option also deletes the current files (ie. those without .nnn. in their suffix).