Configuration¶
When invoked, CorrectOCR looks for a file named CorrectOCR.ini in
the working directory. If found, it is loaded, and any entries will be
considered defaults to their corresponding option. For example:
[configuration]
characterSet = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
[workspace]
correctedPath = corrected/
goldPath = gold/
originalPath = original/
trainingPath = training/
nheaderlines = 0
[resources]
correctionTrackingFile = resources/correction_tracking.json
dictionaryFile = resources/dictionary.txt
hmmParamsFile = resources/hmm_parameters.json
memoizedCorrectionsFile = resources/memoized_corrections.json
multiCharacterErrorFile = resources/multicharacter_errors.json
reportFile = resources/report.txt
heuristicSettingsFile = resources/settings.json
[storage]
type = fs
By default, CorrectOCR requires 4 subdirectories in the working
directory, which will be used as the current Workspace:
original/contains the original uncorrected files. If necessary, it can be configured with the--originalPathargument.gold/contains the known correct “gold” files. If necessary, it can be configured with the--goldPathargument.training/contains the various generated files used during training. If necessary, it can be configured with the--trainingPathargument.corrected/contains the corrected files generated by running thecorrectcommand. If necessary, it can be configured with the--correctedPathargument.
Corresponding files in original, gold, and corrected are named
identically, and the filename without extension is considered the file
ID. The generated files in training/ have suffixes according to
their kind.
If generated files exist, CorrectOCR will generally avoid doing
redundant calculations. The --force switch overrides this, forcing
CorrectOCR to create new files (after moving the existing ones out of
the way). Alternately, one may delete a subset of the generated files to
only recreate those.
The Workspace also has a ResourceManager (accessible in code via
.resources) that handles access to the dictionary, HMM parameter
files, etc.