CorrectOCR.workspace module

class CorrectOCR.workspace.LazyDocumentDict(workspace, *args, **kargs)[source]

Bases: collections.abc.MutableMapping

class CorrectOCR.workspace.Workspace(workspaceconfig, resourceconfig, storageconfig)[source]

Bases: object

The Workspace holds references to Documents and resources used by the various commands.

Parameters
  • workspaceconfig

    An object with the following properties:

    • nheaderlines (int): The number of header lines in corpus texts.

    • language: A language instance from pycountry <https://pypi.org/project/pycountry/>.

    • originalPath (Path): Directory containing the original docs.

    • goldPath (Path): Directory containing the gold (if any) docs.

    • trainingPath (Path): Directory for storing intermediate docs.

    • docInfoBaseURL (str): Base URL that when appended with a doc_id provides information about documents.

  • resourceconfig – Passed directly to ResourceManager, see this for further info.

  • storageconfig – TODO

add_doc(doc)[source]

Initializes a new Document and adds it to the workspace.

The doc_id of the document will be determined by its filename.

If the file is not in the originalPath, it will be copied or downloaded there.

Parameters

doc (Any) – A path or URL.

Return type

str

documents(ext=None, server_ready=False, is_done=False)[source]

Yields documents filtered by the given criteria.

Param

ext Only include docs with this extension.

Param

server_ready Only include documents that are ready (prepared).

Param

is_done Only include documents that are done (all tokens have gold).

Return type

List[str]

cleanup(dryrun=True, full=False)[source]

Cleans out the backup files in the trainingPath.

Parameters
  • dryrun – Just lists the files without actually deleting them

  • full – Also deletes the current files (ie. those without .nnn. in their suffix).

class CorrectOCR.workspace.CorpusFile(path, nheaderlines=0)[source]

Bases: object

Simple wrapper for text files to manage a number of lines as a separate header.

Parameters
  • path (Path) – Path to text file.

  • nheaderlines (int) – Number of lines from beginning to separate out as header.

save()[source]

Concatenate header and body and save.

is_file()[source]
Return type

bool

Returns

Does the file exist? See pathlib.Path.is_file().

property id
class CorrectOCR.workspace.JSONResource(path, **kwargs)[source]

Bases: dict

Simple wrapper for JSON files.

Parameters
  • path – Path to load from.

  • kwargs – TODO

save()[source]

Save to JSON file.

class CorrectOCR.workspace.ResourceManager(root, config)[source]

Bases: object

Helper for the Workspace to manage various resources.

Parameters
  • root (Path) – Path to resources directory.

  • config

    An object with the following properties:

    • correctionTrackingFile (Path): Path to file containing correction tracking.

    • TODO