CorrectOCR.workspace module¶
-
class
CorrectOCR.workspace.Workspace(workspaceconfig, resourceconfig, storageconfig)[source]¶ Bases:
objectThe Workspace holds references to
Documentsand resources used by the variouscommands.- Parameters
workspaceconfig –
An object with the following properties:
nheaderlines (
int): The number of header lines in corpus texts.language: A language instance from pycountry <https://pypi.org/project/pycountry/>.
originalPath (
Path): Directory containing the original docs.goldPath (
Path): Directory containing the gold (if any) docs.trainingPath (
Path): Directory for storing intermediate docs.correctedPath (
Path): Directory for saving corrected docs.
resourceconfig – Passed directly to
ResourceManager, see this for further info.storageconfig – TODO
-
add_doc(doc)[source]¶ Initializes a new
Documentand adds it to the workspace.The doc_id of the document will be determined by its filename.
If the file is not in the originalPath, it will be copied or downloaded there.
-
class
CorrectOCR.workspace.Document(workspace, doc, original, gold, training, corrected, nheaderlines=0)[source]¶ Bases:
object- Parameters
doc (
Path) – A path to a file.original (
Path) – Directory for original uncorrected files.gold (
Path) – Directory for known correct “gold” files (if any).training (
Path) – Directory for storing intermediate files.corrected (
Path) – Directory for saving corrected files.nheaderlines (
int) – Number of lines in file header (only relevant for.txtfiles)
-
tokenFile¶ Path to token file (CSV format).
-
fullAlignmentsFile¶ Path to full letter-by-letter alignments (JSON format).
-
wordAlignmentsFile¶ Path to word-by-word alignments (JSON format).
-
readCountsFile¶ Path to letter read counts (JSON format).
-
alignments(force=False)[source]¶ Uses the
Alignerto generate alignments for a given original, gold pair of docs.Caches its results in the
trainingPath.
-
class
CorrectOCR.workspace.CorpusFile(path, nheaderlines=0)[source]¶ Bases:
objectSimple wrapper for text files to manage a number of lines as a separate header.
- Parameters
-
is_file()[source]¶ - Return type
- Returns
Does the file exist? See
pathlib.Path.is_file().
-
property
id¶
-
class
CorrectOCR.workspace.JSONResource(path, **kwargs)[source]¶ Bases:
dictSimple wrapper for JSON files.
- Parameters
path – Path to load from.
kwargs – TODO