CorrectOCR.workspace module¶
- class CorrectOCR.workspace.Workspace(workspaceconfig, resourceconfig, storageconfig)[source]¶
Bases:
objectThe Workspace holds references to
Documentsand resources used by the variouscommands.- Parameters
workspaceconfig –
An object with the following properties:
nheaderlines (
int): The number of header lines in corpus texts.language: A language instance from pycountry <https://pypi.org/project/pycountry/>.
originalPath (
Path): Directory containing the original docs.goldPath (
Path): Directory containing the gold (if any) docs.trainingPath (
Path): Directory for storing intermediate docs.docInfoBaseURL (
str): Base URL that when appended with a doc_id provides information about documents.
resourceconfig – Passed directly to
ResourceManager, see this for further info.storageconfig – TODO
- add_doc(doc)[source]¶
Initializes a new
Documentand adds it to the workspace.The doc_id of the document will be determined by its filename.
If the file is not in the originalPath, it will be copied or downloaded there.
- class CorrectOCR.workspace.CorpusFile(path, nheaderlines=0)[source]¶
Bases:
objectSimple wrapper for text files to manage a number of lines as a separate header.
- Parameters
- is_file()[source]¶
- Return type
- Returns
Does the file exist? See
pathlib.Path.is_file().
- property id¶
- class CorrectOCR.workspace.JSONResource(path, **kwargs)[source]¶
Bases:
dictSimple wrapper for JSON files.
- Parameters
path – Path to load from.
kwargs – TODO