CatmaProject#

class CatmaProject(project_name, projects_directory='./', included_acs=None, excluded_acs=None, ac_filter_keyword=None, load_from_gitlab=False, gitlab_access_token=None, backup_directory='./')#

Bases: object

Class that represents a CATMA project including all documents, tagsets and annotation collections.

You can either load the project from a local Git clone or you can load it directly from GitLab after generating an access token in the CATMA GUI. See the examples in the docs for details.

Parameters
  • project_name (str) – The CATMA project name. Defaults to None.

  • projects_directory (str, optional) – The directory where your CATMA projects are located. Defaults to ‘./’.

  • included_acs (list, optional) – All annotation collections that should get loaded. If annotation collections are neither included nor excluded all annotation collections get loaded. Defaults to None.

  • excluded_acs (list, optional) – Annotation collections that should not get loaded. Defaults to None.

  • ac_filter_keyword (str, bool) – Only annotation collections with the given keyword get loaded.

  • load_from_gitlab (bool, optional) – Whether the CATMA project should be loaded directly from CATMA’s GitLab backend. Defaults to False.

  • gitlab_access_token (str, optional) – The private CATMA GitLab access token. Defaults to None.

  • backup_directory (str, optional) – The directory where your project clone should be located. Defaults to ‘./’.

Raises

FileNotFoundError – If the local or remote CATMA project was not found.

uuid: str#

The project’s UUID.

projects_directory: str#

The directory where the project is located.

name: str#

The project’s name.

tagsets: List[gitma.tagset.Tagset]#

List of gitma.Tagset objects.

tagset_dict: Dict[str, gitma.tagset.Tagset]#

Dictionary of the project’s tagsets with the UUIDs as keys and gitma.Tagset objects as values.

texts: List[gitma.text.Text]#

List of the gitma.Text objects.

text_dict: Dict[str, gitma.text.Text]#

Dictionary of the project’s texts with titles as keys and gitma.Text objects as values.

annotation_collections: List[gitma.annotation_collection.AnnotationCollection]#

List of gitma.AnnotationCollection objects.

ac_dict: Dict[str, gitma.annotation_collection.AnnotationCollection]#

Dictionary of the project’s annotation collections with names as keys and gitma.AnnotationCollection objects as values.

to_json(annotation_collections='all', rename_dict=None, included_tags=None, directory='./')#

Saves all annotations as a single JSON file.

Parameters
  • annotation_collections (Union[List[str], str], optional) – Parameter to define the exported annotation collections. Defaults to ‘all’.

  • rename_dict (Union[Dict[str, str], None], optional) – Dictionary to rename annotation collections. Defaults to None.

  • included_tags (Union[list, None]) – Tags included in the annotations list. If None all tags are included. Defaults to None.

  • directory (str) – Backup directory. Defaults to ‘./’.

update()#

Updates local git folder and reloads CatmaProject.

Warning: This method can only be used if you have Git installed.

annotations()#

Generator that yields all annotations as gitma.annotation.Annotation objects.

Yields

Annotation – gitma.annotation.Annotation

all_tags()#

Generator that yields all tags as gitma.tag.Tag objects.

Yields

Tag – gitma.tag.Tag

stats()#

Shows some CATMA Project stats.

Returns

DataFrame with projects stats sorted by the Annotation Collection names.

Return type

pd.DataFrame

write_annotation_json(text_title, annotation_collection_name, tagset_name, tag_name, start_points, end_points, property_annotations, author)#

Function to write a new annotation into this project.

Parameters
  • text_title (str) – The text title.

  • annotation_collection_name (str) – The name of the target annotation collection.

  • tagset_name (str) – The tagset’s name.

  • tag_name (str) – The tag’s name.

  • start_points (list) – The start points of the annotation spans.

  • end_points (list) – The end points of the annotation spans.

  • property_annotations (dict) – A dictionary with property names mapped to value lists.

  • author (str) – The annotation’s author.

create_gold_annotations(ac_1_name, ac_2_name, gold_ac_name, excluded_tags=None, min_overlap=1.0, same_tag=True, property_values='none', push_to_gitlab=False)#

Searches for matching annotations in 2 AnnotationCollections and copies all matches in a third AnnotationCollection. By default only matching Property Values get copied.

Parameters
  • ac_1_name (str) – AnnotationCollection 1 Name.

  • ac_2_name (str) – AnnnotationCollection 2 Name.

  • gold_ac_name (str) – AnnotationCollection Name for Gold Annotations.

  • excluded_tags (list, optional) – Annotations with this Tags will not be included in the Gold Annotations. Defaults to None.

  • min_overlap (float, optional) – The minimal overlap to genereate a gold annotation. Defaults to 1.0.

  • same_tag (bool, optional) – Whether both annotations need to be the same tag. Defaults to True.

  • property_values (str, optional) – Whether only matching Property Values from AnnonationCollection 1 shall be copied. Default to ‘matching’. Further options: ‘none’.

  • push_to_gitlab (bool, optional) – Whether the gold annotations shall be uploaded to the CATMA GitLab. Default to False.

merge_annotations()#

Concatenates all annotation collections to one pandas data frame and resets index.

Returns

Data frame including all annotation in the CATMA project.

Return type

pd.DataFrame

merge_annotations_per_document()#

Merges all annotations per document to one annotation collection.

Returns

Dictionary with document titles as keys and annotations per document as pandas data frame.

Return type

Dict[str, pd.DataFrame]

plot_annotation_progression()#

Plot the annotation progression for every annotator in a CATMA project.

Returns

Plotly scatter plot.

Return type

go.Figure

plot_interactive(color_col='annotation collection')#

This function generates one Plotly scatter plot per annotated document in a CATMA project. By default the colors represent the annotation collections. By that they can be deactivated with the interactive legend.

Parameters

color_col (str, optional) – ‘annotation collection’, ‘annotator’, ‘tag’ or any property with the prefix ‘prop:’. Defaults to ‘annotation collection’.

Returns

Plotly scatter plot.

Return type

go.Figure

plot_annotations(color_col='annotation collection')#

This function generates one Plotly scatter plot per annotated document in a CATMA project. By default the colors represent the annotation collections. By that they can be deactivated with the interactive legend.

Parameters

color_col (str, optional) – ‘annotation collection’, ‘annotator’, ‘tag’ or any property with the prefix ‘prop:’. Defaults to ‘annotation collection’.

Returns

Plotly scatter plot.

Return type

go.Figure

cooccurrence_network(annotation_collections='all', character_distance=100, included_tags=None, excluded_tags=None, level='tag', plot_stats=False, save_as_gexf=False)#

Draws cooccurrence network graph for annotations.

Every tag is represented by a node and every edge represents two cooccurent tags. You can by the character_distance parameter when two annotations are considered cooccurent. If you set character_distance=0 only the tags of overlapping annotations will be represented as connected nodes.

See the examples in the docs for details about the usage.

Parameters
  • annotation_collections (Union[str, List[str]]) – List with the names of the included annotation collections. If set to ‘all’ all annotation collections are included. Defaults to ‘all’.

  • character_distance (int, optional) – In which distance annotations are considered coocurrent. Defaults to 100.

  • included_tags (list, optional) – List of included tags. Defaults to None.

  • excluded_tags (list, optional) – List of excluded tags. Defaults to None.

  • level (str, optional) – ‘tag’ or any property name with ‘prop:’ as prefix. Defaults to ‘tag’.

  • plot_stats (bool, optional) – Whether to return network stats. Defaults to False.

  • save_as_gexf (bool, optional) – If given any string the network gets saved as Gephi file with the string as filename.

disagreement_network(annotation_collections='all', included_tags=None, excluded_tags=None, level='tag', plot_stats=False, save_as_gexf=False)#

Draws disagreement network.

Every edge in the network represents two overlapping annotations from different annotation collections and with different tags or property values.

Parameters
  • annotation_collections (Union[str, List[str]], optional) – List with the names of the included annotation collections. If set to ‘all’ all annotation collections are included. Defaults to ‘all’.

  • included_tags (list, optional) – List of included tags. Defaults to None.

  • excluded_tags (list, optional) – List of excluded tags. Defaults to None.

  • level (str, optional) – ‘tag’ or any property name with ‘prop:’ as prefix. Defaults to ‘tag’.

  • plot_stats (bool, optional) – Whether to return network stats. Defaults to False.

  • save_as_gexf (bool, optional) – If given any string the network gets saved as Gephi file with the string as filename.

compare_annotation_collections(annotation_collections, color_col='tag')#

Plots annotations of multiple annotation collections of the same texts as line plot.

Parameters
  • annotation_collections (list) – A list of annotation collection names.

  • color_col (str, optional) – Either ‘tag’ or one property name with prefix ‘prop:’. Defaults to ‘tag’.

Raises

ValueError – If one of the annotation collection’s names does not exist.

Returns

Plotly Line Plot.

Return type

go.Figure

get_iaa(ac1_name, ac2_name, tag_filter=None, filter_both_ac=False, level='tag', include_empty_annotations=True, distance='binary', return_as_dict=False)#

Computes Inter Annotator Agreement for 2 Annotation Collections. See the demo notebook for details.

Parameters
  • ac1_name (str) – AnnotationCollection name to be compared.

  • ac2_name (str) – AnnotationCollection name to be compared with.

  • tag_filter (list, optional) – Which Tags should be included. If None all are included. Default to None.

  • filter_both_ac (bool, optional) – Whether the tag filter shall be aplied to both annotation collections. Defaults to False.

  • level (str, optional) – Whether the Annotation Tag or a specified Property should be compared. Defaults to ‘tag’.

  • include_empty_annotations (bool, optionale) – If False only annotations with a overlapping annotation in the second collection get included. Defaults to True.

  • distance (str, optional) – The IAA distance function. Either ‘binary’ or ‘interval’. See the NLTK API for further informations. Defaults to ‘binary’.

gamma_agreement(annotation_collections, alpha=3, beta=1, delta_empty=0.01, n_samples=30, precision_level=0.01)#
pygamma_table(annotation_collections='all')#

Concatenates annotation collections to pygamma table.

Parameters

annotation_collections (Union[str, list], optional) – List of annotation collections. Defaults to ‘all’.

Returns

Concatenated annotation collections as pd.DataFrame in pygamma format.

Return type

pd.DataFrame

Examples#

Load a local CATMA project#

If you load a CATMA project you already have cloned only the project’s name and its location are required to load the project:

project = gitma.CatmaProject(
   project_name='DemoProject',
   projects_directory='../user_projects/'
)

Adding the paramter included_acs, excluded_acs and ac_filter_keyword, you can select which annotation collections get loaded. Assuming you have a project with three annotation collections named ‘ac1’, ‘ac2’ and ‘AC3’ you can select the two first annotation collections by any of these methods:

# option 1
project = gitma.CatmaProject(
   project_name='DemoProject',
   projects_directory='../user_projects/',
   included_acs=['ac1', 'ac2']
)

# option 2
project = gitma.CatmaProject(
   project_name='DemoProject',
   projects_directory='../user_projects/',
   excluded_acs=['AC3']
)

# option 3
project = gitma.CatmaProject(
   project_name='DemoProject',
   projects_directory='../user_projects/',
   ac_filter_keyword='ac'
)

Load a remote CATMA project#

If you load your project from the CATMA gitlab three further paramters are required:

project = gitma.CatmaProject(
   project_name='DemoProject',
   load_from_gitlab=True,
   gitlab_access_token='<your_access_token>',
   backup_directory='../user_projects/'
)

By loading a remote CATMA project it will be cloned in the backup_directory. After you loaded a CATMA project in this directory once you have to load this project as a local project, as demonstrated above.

Plot a cooccurrence network for the annotations in your project#

You can plot coocurrent annotations:

project.cooccurrence_network()

You can customize your network by the following parameters:

project.cooccurrence_network(
   annotation_collections=[      # define the included annotation collections
      '<your_first_annotation_collection>',
      '<your_second_annotation_collection>'
   ],
   level='prop:<your_property>', # set a property as level
   character_distance=50,        # define which distance is considered cooccurrent
   included_tags=None,           # define a list with tags included
   excluded_tags=None,           # define a list with tags excluded
   save_as_gexf='my_gephi_file'  # save your network as Gephi file
)