CorrectOCR.workspace module¶
- class CorrectOCR.workspace.Workspace(workspaceconfig, resourceconfig, storageconfig)[source]¶
Bases:
object
The Workspace holds references to
Documents
and resources used by the variouscommands
.- Parameters
workspaceconfig –
An object with the following properties:
nheaderlines (
int
): The number of header lines in corpus texts.language: A language instance from pycountry <https://pypi.org/project/pycountry/>.
originalPath (
Path
): Directory containing the original docs.goldPath (
Path
): Directory containing the gold (if any) docs.trainingPath (
Path
): Directory for storing intermediate docs.docInfoBaseURL (
str
): Base URL that when appended with a doc_id provides information about documents.
resourceconfig – Passed directly to
ResourceManager
, see this for further info.storageconfig – TODO
- add_doc(doc)[source]¶
Initializes a new
Document
and adds it to the workspace.The doc_id of the document will be determined by its filename.
If the file is not in the originalPath, it will be copied or downloaded there.
- class CorrectOCR.workspace.CorpusFile(path, nheaderlines=0)[source]¶
Bases:
object
Simple wrapper for text files to manage a number of lines as a separate header.
- Parameters
- is_file()[source]¶
- Return type
- Returns
Does the file exist? See
pathlib.Path.is_file()
.
- property id¶
- class CorrectOCR.workspace.JSONResource(path, **kwargs)[source]¶
Bases:
dict
Simple wrapper for JSON files.
- Parameters
path – Path to load from.
kwargs – TODO