CorrectOCR.commands module¶
Commands¶
Commands and their arguments are called directly on the module, like so:
python -m CorrectOCR [command] [args...]
The following commands are available:
build_dictionarycreates a dictionary. Input files can be either.pdf,.txt, or.xml(in TEI format). They may be contained in.zip-files.The
--corpusPathoption specifies a directory of files.The
--corpusFileoption specifies a file containing paths and URLs. One such file for a dictionary covering 1800–1948 Danish is provided underresources/.The
--clearoption clears the dictionary before adding words (the file is backed up first).
It is strongly recommended to generate a large dictionary for best performance.
alignaligns a pair of (original, gold) files in order to determine which characters and words were misread in the original and corrected in the gold.The
--fileidoption specifies a single pair of files to align.The
--alloption aligns all available pairs. Can be combined with--excludeto skip specific files.
build_modeluses the alignments to create parameters for the HMM.The
--smoothingParameteroption can be adjusted as needed.
addcopies or downloads files to the workspace. One may provide about a single file directly, or use the option to provide a list of files.The
--documentsoption specifies a file containing paths and URLs.The
--max_countoption specifies the maximum number of files to add.The
--prepare_stepoption allows the automatic preparation of the files as they are added. See below.
preparetokenizes and prepare texts for corrections.The
--fileidoption specifies which file to tokenize.The
--alloption tokenizes all available texts. Can be combined with--excludeto skip specific files.The
--stepoption specifies how many of the processing steps to take. The default is to take all steps.tokenizesimply splits the text into tokens (words).alignaligns tokens with gold versions, if these exist.kbestcalculates k-best correction candidates for each token via the HMM.binsorts the tokens into bins according to the heuristics below.
Each of the steps includes the previous step, and will save intermediary information about each token to CSV or a databases.
statsis used to configure which decisions the program should make about each bin of tokens:--make_reportgenerates a statistical report on whether originals/k-best equal are in the dictionary, etc. This report can then be inspected and annotated with the desired decision for each bin.--make_settingscreates correction settings based on the annotated report.
correctuses the settings to sort the tokens into bins and makes automated decisions as configured.The
--fileidoption specifies which file to correct.
There are three ways to run corrections:
--interactiveruns an interactive correction CLI for the remaining undecided tokens (see Correction Interface below).--applytakes a path argument to an edited token CSV file and applies the corrections therein.--autocorrectapplies available corrections as configured in correction settings (ie. any heuristic bins not marked for human annotation).
indexfinds specified terms for use in index-generation.The
--fileidoption specifies a single file for which to generate an index.The
--alloption generates indices for all available files. Can be combined with--excludeto skip specific files.The
--termFileoption specifies a text file containing a word on each line, which will be matched against the tokens. The option may be repeated, and each filename (without extension) will be used as markers for the string.The
--highlightoption will create a copy of the input files with highlighted words (only available for PDFs).The
--autocorrectoption applies available corrections prior to search/highlighting, as above.
serverstarts a simple Flask backend server that providesJSONdescriptions and.pngimages of tokens, as well as acceptsPOST-requests to update tokens with corrections.cleanupdeletes the backup files in the training directory.The
--dryrunoption simply lists the files without actually deleting them.The
--fulloption also deletes the current files (ie. those without .nnn. in their suffix).