Configuration¶
When invoked, CorrectOCR looks for a file named CorrectOCR.ini
in
the working directory. If found, it is loaded, and any entries will be
considered defaults to their corresponding option. For example:
[configuration]
characterSet = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
[workspace]
correctedPath = corrected/
goldPath = gold/
originalPath = original/
trainingPath = training/
nheaderlines = 0
[resources]
correctionTrackingFile = resources/correction_tracking.json
dictionaryFile = resources/dictionary.txt
hmmParamsFile = resources/hmm_parameters.json
memoizedCorrectionsFile = resources/memoized_corrections.json
multiCharacterErrorFile = resources/multicharacter_errors.json
reportFile = resources/report.txt
heuristicSettingsFile = resources/settings.json
[storage]
type = fs
By default, CorrectOCR requires 4 subdirectories in the working
directory, which will be used as the current Workspace
:
original/
contains the original uncorrected files. If necessary, it can be configured with the--originalPath
argument.gold/
contains the known correct “gold” files. If necessary, it can be configured with the--goldPath
argument.training/
contains the various generated files used during training. If necessary, it can be configured with the--trainingPath
argument.corrected/
contains the corrected files generated by running thecorrect
command. If necessary, it can be configured with the--correctedPath
argument.
Corresponding files in original, gold, and corrected are named
identically, and the filename without extension is considered the file
ID. The generated files in training/
have suffixes according to
their kind.
If generated files exist, CorrectOCR will generally avoid doing
redundant calculations. The --force
switch overrides this, forcing
CorrectOCR to create new files (after moving the existing ones out of
the way). Alternately, one may delete a subset of the generated files to
only recreate those.
The Workspace
also has a ResourceManager
(accessible in code via
.resources
) that handles access to the dictionary, HMM parameter
files, etc.