Commands¶
Global Arguments¶
If the global arguments are not provided on the command line, CorrectOCR.ini and environment variables are checked (see CorrectOCR.config
).
Workspace¶
These arguments configure the Workspace
, ie. where the documents are located.
usage: python -m CorrectOCR [-h] [--rootPath PATH] [--originalPath PATH]
[--goldPath PATH] [--trainingPath PATH]
[--docInfoBaseURL URL] [--nheaderlines N]
[--language LANGUAGE]
[--combine_hyphenated_images [COMBINE_HYPHENATED_IMAGES]]
Named Arguments¶
- --rootPath
Path to root of workspace
- --originalPath
Path to directory of original, uncorrected docs
- --goldPath
Path to directory of known correct “gold” docs
- --trainingPath
Path for generated training files
- --docInfoBaseURL
Base URL that serves info about documents
- --nheaderlines
Number of lines in corpus headers
Default: 0
- --language
Language of text
- --combine_hyphenated_images
Generate joined images for hyphenated tokens
Resource¶
These arguments configure the ResourceManager
, eg. dictionary, model, etc.
usage: python -m CorrectOCR [-h] [--resourceRootPath PATH]
[--hmmParamsFile FILE] [--reportFile FILE]
[--heuristicSettingsFile FILE]
[--multiCharacterErrorFile FILE]
[--memoizedCorrectionsFile FILE]
[--correctionTrackingFile FILE]
[--dictionaryFile FILE] [--ignoreCase]
Named Arguments¶
- --resourceRootPath
Path to root of resources
- --hmmParamsFile
Path to HMM parameters (generated from alignment docs via build_model command)
- --reportFile
Path to output heuristics report (TXT file)
- --heuristicSettingsFile
Path to heuristics settings (generated via make_settings command)
- --multiCharacterErrorFile
Path to output multi-character error file (JSON format)
- --memoizedCorrectionsFile
Path to memoizations of corrections.
- --correctionTrackingFile
Path to correction tracking.
- --dictionaryFile
Path to dictionary file
- --ignoreCase
Use case insensitive dictionary comparisons
Default: False
Storage¶
These arguments configure the TokenList
backend storage.
usage: python -m CorrectOCR [-h] [--type {db,fs}] [--db_driver DB_DRIVER]
[--db_host DB_HOST] [--db_user DB_USER]
[--db_password DB_PASSWORD] [--db DB]
Named Arguments¶
- --type
Possible choices: db, fs
Storage type
- --db_driver
Database hostname
- --db_host
Database hostname
- --db_user
Database username
- --db_password
Database user password
- --db
Database name
Commands¶
Correct OCR
usage: python -m CorrectOCR [-h] [-k K] [--force]
[--loglevel {CRITICAL,FATAL,ERROR,WARNING,INFO,DEBUG}]
[--dehyphenate DEHYPHENATE]
{dictionary,align,model,add,prepare,crop,stats,correct,index,cleanup,server}
...
Positional Arguments¶
- command
Possible choices: dictionary, align, model, add, prepare, crop, stats, correct, index, cleanup, server
Choose command
Named Arguments¶
- -k
Number of k-best candidates to use for tokens (default: 4)
Default: 4
- --force
Force command to run
Default: False
- --loglevel
Possible choices: CRITICAL, FATAL, ERROR, WARNING, INFO, DEBUG
Log level
Default: “INFO”
- --dehyphenate
Automatically mark new tokens as hyphenated if they end with a dash
Sub-commands:¶
dictionary¶
Dictionary-related commands.
python -m CorrectOCR dictionary [-h] {build,check} ...
Sub-commands:¶
build¶
Build dictionary.
Input files can be either
.txt
, or.xml
(in TEI format). They may be contained in.zip
-files.A
corpusFile
for 1800–1948 Danish is available in theworkspace/resources/
directory.It is strongly recommended to generate a large dictionary for best performance.
See CorrectOCR.dictionary for further details.
python -m CorrectOCR dictionary build [-h] [--corpusPath CORPUSPATH]
[--corpusFile CORPUSFILE]
[--add_annotator_gold] [--clear]
- --corpusPath
Directory of files to split into words and add to dictionary
- --corpusFile
File containing paths and URLs to use as corpus (TXT format)
- --add_annotator_gold
Add gold words from annotated tokens
Default: False
- --clear
Clear the dictionary before adding words
Default: False
check¶
Undocumented
python -m CorrectOCR dictionary check [-h] [words [words ...]]
- words
Words to check in dictionary
align¶
Create alignments.
The tokens of each pair of (original, gold) files are aligned in order to determine which characters and words were misread in the original and corrected in the gold.
These alignments can be used to train the model.
See CorrectOCR.aligner for further details.
python -m CorrectOCR align [-h]
(--docid DOCID | --docids DOCIDS [DOCIDS ...] | --all)
[--exclude EXCLUDE]
Named Arguments¶
- --docid
Input document ID (filename without path or extension)
- --docids
Input multiple document IDs
- --all
Align all original/gold pairs
Default: False
- --exclude
Doc ID to exclude (can be specified multiple times)
Default: []
model¶
Build model. # TODO # This is done with the aligned original/gold-documents. If none exist, an attempt will be made to create them.
The result is an HMM as described in the original paper.
See CorrectOCR.model for further details.
python -m CorrectOCR model [-h] (--build | --get_kbest GET_KBEST)
[--smoothingParameter N[.N]] [--other OTHER]
Named Arguments¶
- --build
Rebuild model
Default: False
- --get_kbest
Get k-best for word with current model
- --smoothingParameter
Smoothing parameters for HMM
Default: 0.0001
- --other
Compare against other model
add¶
Add documents for processing
One may add a single document directly on the command line, or provide a text file containing a list of documents.
They will be copied or downloaded to the
workspace/original/
folder.See CorrectOCR.workspace.Workspace for further details.
python -m CorrectOCR add [-h] [--documentsFile DOCUMENTSFILE]
[--prepare_step {tokenize,align,kbest,bin,all,server}]
[--max_count MAX_COUNT]
[document]
Positional Arguments¶
- document
Single path/URL to document
Named Arguments¶
- --documentsFile
File containing list of files/URLS to documents
- --prepare_step
Possible choices: tokenize, align, kbest, bin, all, server
Automatically prepare added documents
- --max_count
Maximum number of new documents to add from –documentsFile.
prepare¶
Prepare text for correction.
See CorrectOCR.workspace.Document for further details on the possible steps.
python -m CorrectOCR prepare [-h]
(--docid DOCID | --docids DOCIDS [DOCIDS ...] | --all | --skip_done)
[--exclude EXCLUDE]
[--step {tokenize,rehyphenate,align,kbest,bin,all,server}]
[--autocrop] [--precache_images]
Named Arguments¶
- --docid
Input document ID (filename without path or extension)
- --docids
Input multiple document IDs
- --all
Select all documents
Default: False
- --skip_done
Select only unfinished documents
Default: False
- --exclude
Doc ID to exclude (can be specified multiple times)
Default: []
- --step
Possible choices: tokenize, rehyphenate, align, kbest, bin, all, server
Default: “all”
- --autocrop
Discard tokens near page edges
Default: False
- --precache_images
Create images for the server API
Default: False
crop¶
Mark tokens near the edges of a page as disabled.
This may be desirable for scanned documents where the OCR has picked up partial words or sentences near the page edges.
The tokens are not discarded, merely marked disabled so they don’t show up in the correction interface or generated gold files.
If neither –edge_left nor –edge_right are provided, an attempt will be made to calculate them automatically.
python -m CorrectOCR crop [-h]
(--docid DOCID | --docids DOCIDS [DOCIDS ...] | --all)
[--edge_left EDGE_LEFT] [--edge_right EDGE_RIGHT]
Named Arguments¶
- --docid
Input document ID (filename without path or extension)
- --docids
Input multiple document IDs
- --all
Prepare all original/gold pairs
Default: False
- --edge_left
Set left cropping edge (in pixels)
- --edge_right
Set right cropping edge (in pixels)
stats¶
Calculate stats about corrected documents.
The procedure is to first generate a report that shows how many tokens have been sorted into each bin. This report can then be annotated with the desired decision for each bin, and use this annotated report to generate settings for the heuristics.
See CorrectOCR.heuristics.Heuristics for further details.
python -m CorrectOCR stats [-h] (--make_report | --make_settings) [--rebin]
[--only_done ONLY_DONE]
Named Arguments¶
- --make_report
Make heuristics statistics report from finished documents
Default: False
- --make_settings
Make heuristics settings from report
Default: False
- --rebin
Rerun kbest/bin steps to compare quality (will take longer)
Default: False
- --only_done
Whether to include all or only fully annotated documents
Default: True
correct¶
Apply corrections
python -m CorrectOCR correct [-h] (--docid DOCID | --filePath FILEPATH)
(--interactive | --apply APPLY | --autocorrect | --gold_ready)
[--highlight]
Named Arguments¶
- --docid
Input document ID (filename without path or extension)
- --filePath
Input file path (will be copied to originalPath directory)
- --interactive
Use interactive shell to input and approve suggested corrections
Default: False
- --apply
Apply externally corrected token CSV to original document
- --autocorrect
Apply automatic corrections as configured in settings
Default: False
- --gold_ready
Apply gold from ready document
Default: False
- --highlight
Create a copy with highlighted words (only available for PDFs)
Default: False
index¶
Generate index data
python -m CorrectOCR index [-h] (--docid DOCID | --filePath FILEPATH)
[--exclude EXCLUDE] --termFile TERMFILES
[--highlight] [--autocorrect]
Named Arguments¶
- --docid
Input document ID (filename without path or extension)
- --filePath
Input file path (will be copied to originalPath directory)
- --exclude
Doc ID to exclude (can be specified multiple times)
Default: []
- --termFile
File containing a string on each line, which will be matched against the tokens
Default: []
- --highlight
Create a copy with highlighted words (only available for PDFs)
Default: False
- --autocorrect
Apply automatic corrections as configured in settings
Default: False
cleanup¶
Clean up intermediate files
python -m CorrectOCR cleanup [-h] [--dryrun] [--full]
Named Arguments¶
- --dryrun
Dont delete files, just list them
Default: False
- --full
Also delete the most recent files (without .nnn. in suffix)
Default: False
server¶
Run basic JSON-dispensing Flask server
python -m CorrectOCR server [-h] [--host HOST] [--debug] [--profile [PROFILE]]
[--dynamic_images DYNAMIC_IMAGES]
[--redirect_hyphenated REDIRECT_HYPHENATED]
Named Arguments¶
- --host
The host address
- --debug
Runs the server in debug mode (see Flask docs)
Default: False
- --profile
Use Werkzeug profiler middleware
- --dynamic_images
Should images be generated dynamically?
- --redirect_hyphenated
Redirect requests for hyphenated tokens to “head”