CatmaProject#

class CatmaProject(project_name, projects_directory='./', included_acs=None, excluded_acs=None, ac_filter_keyword=None, load_from_gitlab=False, gitlab_access_token=None, backup_directory='./')#

Bases: object

Class that represents a CATMA project including all documents, tagsets and annotation collections.

You can either load the project from a local Git clone or you can load it directly from GitLab after generating an access token in the CATMA GUI. See the examples in the docs for details.

Parameters
  • project_name (str) – The CATMA project name. Defaults to None.

  • projects_directory (str, optional) – The directory where your CATMA projects are located. Defaults to ‘./’.

  • included_acs (list, optional) – All annotation collections that should get loaded. If annotation collections are neither included nor excluded all annotation collections get loaded. Defaults to None.

  • excluded_acs (list, optional) – Annotation collections that should not get loaded. Defaults to None.

  • ac_filter_keyword (str, bool) – Only annotation collections with the given keyword get loaded.

  • load_from_gitlab (bool, optional) – Whether the CATMA project should be loaded directly from CATMA’s GitLab backend. Defaults to False.

  • gitlab_access_token (str, optional) – The private CATMA GitLab access token. Defaults to None.

  • backup_directory (str, optional) – The directory where your project clone should be located. Defaults to ‘./’.

Raises

FileNotFoundError – If the local or remote CATMA project was not found.

uuid: str#

The project’s UUID.

projects_directory: str#

The directory where the project is located.

name: str#

The project’s name.

tagsets: List[gitma.tagset.Tagset]#

List of gitma.Tagset objects.

tagset_dict: Dict[str, gitma.tagset.Tagset]#

Dictionary of the project’s tagsets with the UUIDs as keys and gitma.Tagset objects as values.

texts: List[gitma.text.Text]#

List of the gitma.Text objects.

text_dict: Dict[str, gitma.text.Text]#

Dictionary of the project’s texts with titles as keys and gitma.Text objects as values.

annotation_collections: List[gitma.annotation_collection.AnnotationCollection]#

List of gitma.AnnotationCollection objects.

ac_dict: Dict[str, gitma.annotation_collection.AnnotationCollection]#

Dictionary of the project’s annotation collections with names as keys and gitma.AnnotationCollection objects as values.

to_json(annotation_collections='all', rename_dict=None, included_tags=None, directory='./')#

Saves all annotations as a single JSON file.

Parameters
  • annotation_collections (Union[List[str], str], optional) – Parameter to define the exported annotation collections. Defaults to ‘all’.

  • rename_dict (Union[Dict[str, str], None], optional) – Dictionary to rename annotation collections. Defaults to None.

  • included_tags (Union[list, None]) – Tags included in the annotations list. If None all tags are included. Defaults to None.

  • directory (str) – Backup directory. Defaults to ‘./’.

update()#

Updates local git folder and reloads CatmaProject.

Warning: This method can only be used if you have Git installed.

annotations()#

Generator that yields all annotations as gitma.annotation.Annotation objects.

Yields

Annotation – gitma.annotation.Annotation

all_tags()#

Generator that yields all tags as gitma.tag.Tag objects.

Yields

Tag – gitma.tag.Tag

stats()#

Shows some CATMA project stats.

Returns

DataFrame with project’s stats sorted by the annotation collection names.

Return type

pd.DataFrame

write_annotation_json(text_title, annotation_collection_name, tagset_name, tag_name, start_points, end_points, property_annotations, author)#

Function to write a new annotation into this project.

Parameters
  • text_title (str) – The text title.

  • annotation_collection_name (str) – The name of the target annotation collection.

  • tagset_name (str) – The tagset’s name.

  • tag_name (str) – The tag’s name.

  • start_points (list) – The start points of the annotation spans.

  • end_points (list) – The end points of the annotation spans.

  • property_annotations (dict) – A dictionary with property names mapped to value lists.

  • author (str) – The annotation’s author.

create_gold_annotations(ac_1_name, ac_2_name, gold_ac_name, excluded_tags=None, min_overlap=1.0, same_tag=True, copy_property_values_if_equal=True, push_to_gitlab=False)#

Searches for matching annotations in two annotation collections of this project and copies all matches into a third annotation collection. By default, property values are copied when they are exactly the same for matching annotations.

Parameters
  • ac_1_name (str) – The name of the first annotation collection.

  • ac_2_name (str) – The name of the second annotation collection.

  • gold_ac_name (str) – The name of the third annotation collection, into which gold annotations will be written.

  • excluded_tags (list, optional) – Annotations with these tags will not be included in the gold annotations. Defaults to None.

  • min_overlap (float, optional) – The minimal overlap to generate a gold annotation. Defaults to 1.0.

  • same_tag (bool, optional) – Whether both annotations have to use the same tag. Defaults to True.

  • copy_property_values_if_equal (bool, optional) – Whether property values should be copied when they are exactly the same for matching annotations. Defaults to True. If False or property values are not exactly the same, no property values are copied.

  • push_to_gitlab (bool, optional) – Whether the gold annotations should be uploaded to the CATMA GitLab backend. Defaults to False.

merge_annotations()#

Concatenates all annotation collections to one pandas data frame and resets index.

Returns

Data frame including all annotation in the CATMA project.

Return type

pd.DataFrame

merge_annotations_per_document()#

Merges all annotations per document to one annotation collection.

Returns

Dictionary with document titles as keys and annotations per document as pandas data frame.

Return type

Dict[str, pd.DataFrame]

plot_annotation_progression()#

Plot the annotation progression for every annotator in a CATMA project.

Returns

Plotly scatter plot.

Return type

go.Figure

plot_interactive(color_col='annotation collection')#

This function generates one Plotly scatter plot per annotated document in a CATMA project. By default the colors represent the annotation collections. By that they can be deactivated with the interactive legend.

Parameters

color_col (str, optional) – ‘annotation collection’, ‘annotator’, ‘tag’ or any property with the prefix ‘prop:’. Defaults to ‘annotation collection’.

Returns

Plotly scatter plot.

Return type

go.Figure

plot_annotations(color_col='annotation collection')#

This function generates one Plotly scatter plot per annotated document in a CATMA project. By default the colors represent the annotation collections. By that they can be deactivated with the interactive legend.

Parameters

color_col (str, optional) – ‘annotation collection’, ‘annotator’, ‘tag’ or any property with the prefix ‘prop:’. Defaults to ‘annotation collection’.

Returns

Plotly scatter plot.

Return type

go.Figure

cooccurrence_network(annotation_collections='all', character_distance=100, included_tags=None, excluded_tags=None, level='tag', plot_stats=False, save_as_gexf=False)#

Draws co-occurrence network graph for annotations.

Every tag is represented by a node and every edge represents two co-occurrent tags. You can by the character_distance parameter when two annotations are considered co-occurrent. If you set character_distance=0 only the tags of overlapping annotations will be represented as connected nodes.

See the examples in the docs for details about the usage.

Parameters
  • annotation_collections (Union[str, List[str]]) – List with the names of the included annotation collections. If set to ‘all’ all annotation collections are included. Defaults to ‘all’.

  • character_distance (int, optional) – In which distance annotations are considered co-occurrent. Defaults to 100.

  • included_tags (list, optional) – List of included tags. Defaults to None.

  • excluded_tags (list, optional) – List of excluded tags. Defaults to None.

  • level (str, optional) – ‘tag’ or any property name with ‘prop:’ as prefix. Defaults to ‘tag’.

  • plot_stats (bool, optional) – Whether to return network stats. Defaults to False.

  • save_as_gexf (bool, optional) – If given any string the network gets saved as Gephi file with the string as filename.

disagreement_network(annotation_collections='all', included_tags=None, excluded_tags=None, level='tag', plot_stats=False, save_as_gexf=False)#

Draws disagreement network.

Every edge in the network represents two overlapping annotations from different annotation collections and with different tags or property values.

Parameters
  • annotation_collections (Union[str, List[str]], optional) – List with the names of the included annotation collections. If set to ‘all’ all annotation collections are included. Defaults to ‘all’.

  • included_tags (list, optional) – List of included tags. Defaults to None.

  • excluded_tags (list, optional) – List of excluded tags. Defaults to None.

  • level (str, optional) – ‘tag’ or any property name with ‘prop:’ as prefix. Defaults to ‘tag’.

  • plot_stats (bool, optional) – Whether to return network stats. Defaults to False.

  • save_as_gexf (bool, optional) – If given any string the network gets saved as Gephi file with the string as filename.

compare_annotation_collections(annotation_collections, color_col='tag')#

Plots annotations of multiple annotation collections of the same texts as line plot.

Parameters
  • annotation_collections (list) – A list of annotation collection names.

  • color_col (str, optional) – Either ‘tag’ or one property name with prefix ‘prop:’. Defaults to ‘tag’.

Raises

ValueError – If one of the annotation collection’s names does not exist.

Returns

Plotly Line Plot.

Return type

go.Figure

get_iaa(ac1_name_or_inst, ac2_name_or_inst, tag_filter=None, filter_both_ac=False, level='tag', include_empty_annotations=True, distance='binary', verbose=True, return_as_dict=False)#

This method is deprecated! See calculate_scotts_pi and calculate_cohens_kappa.

Computes Inter-Annotator-Agreement for two annotation collections. See the demo notebook for details.

Parameters
  • ac1_name_or_inst (str) – The name or instance of the first annotation collection, whose annotations form the basis of the computation.

  • ac2_name_or_inst (str) – The name or instance of the second annotation collection, whose annotations will be searched for matches to those in the first.

  • tag_filter (list, optional) – Which tags should be included. Defaults to None (all tags).

  • filter_both_ac (bool, optional) – Whether the tag filter should be applied to both annotation collections. Defaults to False (only applied to the first collection).

  • level (str, optional) – Whether the annotations’ tags or a specified property (prefixed with ‘prop:’) should be compared. Defaults to ‘tag’.

  • include_empty_annotations (bool, optional) – If False, only annotations with a matching annotation in the second collection are included. Defaults to True.

  • distance (str, optional) – The IAA distance function. Either ‘binary’ or ‘interval’. See the NLTK API for further information. Defaults to ‘binary’.

  • verbose (bool, optional) – Whether to print results to stdout. Defaults to True.

  • return_as_dict (bool, optional) – Whether the computed agreement scores should be returned as a dictionary in addition to being printed (assuming verbose=True). Defaults to False, in which case a Pandas DataFrame with a confusion matrix is returned instead.

calculate_scotts_pi(ac1_name_or_inst, ac2_name_or_inst, tag_filter=None, filter_both_ac=False, level='tag', include_empty_annotations=True, distance='binary', verbose=True)#

Computes the Scott’s Pi inter-annotator-agreement (IAA) metric for two annotation collections. See the demo notebook for details.

Parameters
  • ac1_name_or_inst (str) – The name or instance of the first annotation collection, whose annotations form the basis of the computation.

  • ac2_name_or_inst (str) – The name or instance of the second annotation collection, whose annotations will be searched for matches to those in the first.

  • tag_filter (list, optional) – Which tags should be included. Defaults to None (all tags).

  • filter_both_ac (bool, optional) – Whether the tag filter should be applied to both annotation collections. Defaults to False (only applied to the first collection).

  • level (str, optional) – Whether the annotations’ tags or a specified property (prefixed with ‘prop:’) should be compared. Defaults to ‘tag’.

  • include_empty_annotations (bool, optional) – If False, only annotations with a matching annotation in the second collection are included. Defaults to True.

  • distance (str, optional) –

    The IAA distance function. Either ‘binary’ or ‘interval’. See the NLTK API for further information. Defaults to ‘binary’.

  • verbose (bool, optional) – Whether to print results to stdout. Defaults to True.

Returns

The metric score and a Pandas DataFrame with a confusion matrix.

Return type

Tuple[Union[Float, Any], pd.DataFrame]

calculate_cohens_kappa(ac1_name_or_inst, ac2_name_or_inst, tag_filter=None, filter_both_ac=False, level='tag', include_empty_annotations=True, distance='binary', verbose=True)#

Computes the Cohen’s Kappa inter-annotator-agreement (IAA) metric for two annotation collections. See the demo notebook for details.

Parameters
  • ac1_name_or_inst (str) – The name or instance of the first annotation collection, whose annotations form the basis of the computation.

  • ac2_name_or_inst (str) – The name or instance of the second annotation collection, whose annotations will be searched for matches to those in the first.

  • tag_filter (list, optional) – Which tags should be included. Defaults to None (all tags).

  • filter_both_ac (bool, optional) – Whether the tag filter should be applied to both annotation collections. Defaults to False (only applied to the first collection).

  • level (str, optional) – Whether the annotations’ tags or a specified property (prefixed with ‘prop:’) should be compared. Defaults to ‘tag’.

  • include_empty_annotations (bool, optional) – If False, only annotations with a matching annotation in the second collection are included. Defaults to True.

  • distance (str, optional) –

    The IAA distance function. Either ‘binary’ or ‘interval’. See the NLTK API for further information. Defaults to ‘binary’.

  • verbose (bool, optional) – Whether to print results to stdout. Defaults to True.

Returns

The metric score and a Pandas DataFrame with a confusion matrix.

Return type

Tuple[Union[Float, Any], pd.DataFrame]

gamma_agreement(annotation_collections, alpha=3, beta=1, delta_empty=0.01, n_samples=30, precision_level=0.01)#
pygamma_table(annotation_collections='all')#

Concatenates annotation collections to pygamma table.

Parameters

annotation_collections (Union[str, list], optional) – List of annotation collections. Defaults to ‘all’.

Returns

Concatenated annotation collections as pd.DataFrame in pygamma format.

Return type

pd.DataFrame

Examples#

Load a local CATMA project#

If you load a CATMA project you already have cloned only the project’s name and its location are required to load the project:

project = gitma.CatmaProject(
   project_name='DemoProject',
   projects_directory='../user_projects/'
)

Adding the paramter included_acs, excluded_acs and ac_filter_keyword, you can select which annotation collections get loaded. Assuming you have a project with three annotation collections named ‘ac1’, ‘ac2’ and ‘AC3’ you can select the two first annotation collections by any of these methods:

# option 1
project = gitma.CatmaProject(
   project_name='DemoProject',
   projects_directory='../user_projects/',
   included_acs=['ac1', 'ac2']
)

# option 2
project = gitma.CatmaProject(
   project_name='DemoProject',
   projects_directory='../user_projects/',
   excluded_acs=['AC3']
)

# option 3
project = gitma.CatmaProject(
   project_name='DemoProject',
   projects_directory='../user_projects/',
   ac_filter_keyword='ac'
)

Load a remote CATMA project#

If you load your project from the CATMA gitlab three further paramters are required:

project = gitma.CatmaProject(
   project_name='DemoProject',
   load_from_gitlab=True,
   gitlab_access_token='<your_access_token>',
   backup_directory='../user_projects/'
)

By loading a remote CATMA project it will be cloned in the backup_directory. After you loaded a CATMA project in this directory once you have to load this project as a local project, as demonstrated above.

Plot a cooccurrence network for the annotations in your project#

You can plot coocurrent annotations:

project.cooccurrence_network()

You can customize your network by the following parameters:

project.cooccurrence_network(
   annotation_collections=[      # define the included annotation collections
      '<your_first_annotation_collection>',
      '<your_second_annotation_collection>'
   ],
   level='prop:<your_property>', # set a property as level
   character_distance=50,        # define which distance is considered cooccurrent
   included_tags=None,           # define a list with tags included
   excluded_tags=None,           # define a list with tags excluded
   save_as_gexf='my_gephi_file'  # save your network as Gephi file
)