CatmaProject#
- class CatmaProject(project_name, projects_directory='./', included_acs=None, excluded_acs=None, ac_filter_keyword=None, load_from_gitlab=False, gitlab_access_token=None, backup_directory='./')#
Bases:
objectClass that represents a CATMA project including all documents, tagsets and annotation collections.
You can either load the project from a local Git clone or you can load it directly from GitLab after generating an access token in the CATMA GUI. See the examples in the docs for details.
- Parameters
project_name (str) – The CATMA project name. Defaults to None.
projects_directory (str, optional) – The directory where your CATMA projects are located. Defaults to ‘./’.
included_acs (list, optional) – All annotation collections that should get loaded. If annotation collections are neither included nor excluded all annotation collections get loaded. Defaults to None.
excluded_acs (list, optional) – Annotation collections that should not get loaded. Defaults to None.
ac_filter_keyword (str, bool) – Only annotation collections with the given keyword get loaded.
load_from_gitlab (bool, optional) – Whether the CATMA project should be loaded directly from CATMA’s GitLab backend. Defaults to False.
gitlab_access_token (str, optional) – The private CATMA GitLab access token. Defaults to None.
backup_directory (str, optional) – The directory where your project clone should be located. Defaults to ‘./’.
- Raises
FileNotFoundError – If the local or remote CATMA project was not found.
- uuid: str#
The project’s UUID.
- projects_directory: str#
The directory where the project is located.
- name: str#
The project’s name.
- tagsets: List[gitma.tagset.Tagset]#
List of gitma.Tagset objects.
- tagset_dict: Dict[str, gitma.tagset.Tagset]#
Dictionary of the project’s tagsets with the UUIDs as keys and gitma.Tagset objects as values.
- texts: List[gitma.text.Text]#
List of the gitma.Text objects.
- text_dict: Dict[str, gitma.text.Text]#
Dictionary of the project’s texts with titles as keys and gitma.Text objects as values.
- annotation_collections: List[gitma.annotation_collection.AnnotationCollection]#
List of gitma.AnnotationCollection objects.
- ac_dict: Dict[str, gitma.annotation_collection.AnnotationCollection]#
Dictionary of the project’s annotation collections with names as keys and gitma.AnnotationCollection objects as values.
- to_json(annotation_collections='all', rename_dict=None, included_tags=None, directory='./')#
Saves all annotations as a single JSON file.
- Parameters
annotation_collections (Union[List[str], str], optional) – Parameter to define the exported annotation collections. Defaults to ‘all’.
rename_dict (Union[Dict[str, str], None], optional) – Dictionary to rename annotation collections. Defaults to None.
included_tags (Union[list, None]) – Tags included in the annotations list. If
Noneall tags are included. Defaults to None.directory (str) – Backup directory. Defaults to ‘./’.
- update()#
Updates local git folder and reloads CatmaProject.
Warning: This method can only be used if you have Git installed.
- annotations()#
Generator that yields all annotations as gitma.annotation.Annotation objects.
- Yields
Annotation – gitma.annotation.Annotation
- all_tags()#
Generator that yields all tags as gitma.tag.Tag objects.
- Yields
Tag – gitma.tag.Tag
- stats()#
Shows some CATMA project stats.
- Returns
DataFrame with project’s stats sorted by the annotation collection names.
- Return type
pd.DataFrame
- write_annotation_json(text_title, annotation_collection_name, tagset_name, tag_name, start_points, end_points, property_annotations, author)#
Function to write a new annotation into this project.
- Parameters
text_title (str) – The text title.
annotation_collection_name (str) – The name of the target annotation collection.
tagset_name (str) – The tagset’s name.
tag_name (str) – The tag’s name.
start_points (list) – The start points of the annotation spans.
end_points (list) – The end points of the annotation spans.
property_annotations (dict) – A dictionary with property names mapped to value lists.
author (str) – The annotation’s author.
- create_gold_annotations(ac_1_name, ac_2_name, gold_ac_name, excluded_tags=None, min_overlap=1.0, same_tag=True, copy_property_values_if_equal=True, push_to_gitlab=False)#
Searches for matching annotations in two annotation collections of this project and copies all matches into a third annotation collection. By default, property values are copied when they are exactly the same for matching annotations.
- Parameters
ac_1_name (str) – The name of the first annotation collection.
ac_2_name (str) – The name of the second annotation collection.
gold_ac_name (str) – The name of the third annotation collection, into which gold annotations will be written.
excluded_tags (list, optional) – Annotations with these tags will not be included in the gold annotations. Defaults to
None.min_overlap (float, optional) – The minimal overlap to generate a gold annotation. Defaults to 1.0.
same_tag (bool, optional) – Whether both annotations have to use the same tag. Defaults to
True.copy_property_values_if_equal (bool, optional) – Whether property values should be copied when they are exactly the same for matching annotations. Defaults to
True. IfFalseor property values are not exactly the same, no property values are copied.push_to_gitlab (bool, optional) – Whether the gold annotations should be uploaded to the CATMA GitLab backend. Defaults to
False.
- merge_annotations()#
Concatenates all annotation collections to one pandas data frame and resets index.
- Returns
Data frame including all annotation in the CATMA project.
- Return type
pd.DataFrame
- merge_annotations_per_document()#
Merges all annotations per document to one annotation collection.
- Returns
Dictionary with document titles as keys and annotations per document as pandas data frame.
- Return type
Dict[str, pd.DataFrame]
- plot_annotation_progression()#
Plot the annotation progression for every annotator in a CATMA project.
- Returns
Plotly scatter plot.
- Return type
go.Figure
- plot_interactive(color_col='annotation collection')#
This function generates one Plotly scatter plot per annotated document in a CATMA project. By default the colors represent the annotation collections. By that they can be deactivated with the interactive legend.
- Parameters
color_col (str, optional) – ‘annotation collection’, ‘annotator’, ‘tag’ or any property with the prefix ‘prop:’. Defaults to ‘annotation collection’.
- Returns
Plotly scatter plot.
- Return type
go.Figure
- plot_annotations(color_col='annotation collection')#
This function generates one Plotly scatter plot per annotated document in a CATMA project. By default the colors represent the annotation collections. By that they can be deactivated with the interactive legend.
- Parameters
color_col (str, optional) – ‘annotation collection’, ‘annotator’, ‘tag’ or any property with the prefix ‘prop:’. Defaults to ‘annotation collection’.
- Returns
Plotly scatter plot.
- Return type
go.Figure
- cooccurrence_network(annotation_collections='all', character_distance=100, included_tags=None, excluded_tags=None, level='tag', plot_stats=False, save_as_gexf=False)#
Draws co-occurrence network graph for annotations.
Every tag is represented by a node and every edge represents two co-occurrent tags. You can by the
character_distanceparameter when two annotations are considered co-occurrent. If you setcharacter_distance=0only the tags of overlapping annotations will be represented as connected nodes.See the examples in the docs for details about the usage.
- Parameters
annotation_collections (Union[str, List[str]]) – List with the names of the included annotation collections. If set to ‘all’ all annotation collections are included. Defaults to ‘all’.
character_distance (int, optional) – In which distance annotations are considered co-occurrent. Defaults to 100.
included_tags (list, optional) – List of included tags. Defaults to None.
excluded_tags (list, optional) – List of excluded tags. Defaults to None.
level (str, optional) – ‘tag’ or any property name with ‘prop:’ as prefix. Defaults to ‘tag’.
plot_stats (bool, optional) – Whether to return network stats. Defaults to False.
save_as_gexf (bool, optional) – If given any string the network gets saved as Gephi file with the string as filename.
- disagreement_network(annotation_collections='all', included_tags=None, excluded_tags=None, level='tag', plot_stats=False, save_as_gexf=False)#
Draws disagreement network.
Every edge in the network represents two overlapping annotations from different annotation collections and with different tags or property values.
- Parameters
annotation_collections (Union[str, List[str]], optional) – List with the names of the included annotation collections. If set to ‘all’ all annotation collections are included. Defaults to ‘all’.
included_tags (list, optional) – List of included tags. Defaults to None.
excluded_tags (list, optional) – List of excluded tags. Defaults to None.
level (str, optional) – ‘tag’ or any property name with ‘prop:’ as prefix. Defaults to ‘tag’.
plot_stats (bool, optional) – Whether to return network stats. Defaults to False.
save_as_gexf (bool, optional) – If given any string the network gets saved as Gephi file with the string as filename.
- compare_annotation_collections(annotation_collections, color_col='tag')#
Plots annotations of multiple annotation collections of the same texts as line plot.
- Parameters
annotation_collections (list) – A list of annotation collection names.
color_col (str, optional) – Either ‘tag’ or one property name with prefix ‘prop:’. Defaults to ‘tag’.
- Raises
ValueError – If one of the annotation collection’s names does not exist.
- Returns
Plotly Line Plot.
- Return type
go.Figure
- get_iaa(ac1_name_or_inst, ac2_name_or_inst, tag_filter=None, filter_both_ac=False, level='tag', include_empty_annotations=True, distance='binary', verbose=True, return_as_dict=False)#
This method is deprecated! See
calculate_scotts_piandcalculate_cohens_kappa.Computes Inter-Annotator-Agreement for two annotation collections. See the demo notebook for details.
- Parameters
ac1_name_or_inst (str) – The name or instance of the first annotation collection, whose annotations form the basis of the computation.
ac2_name_or_inst (str) – The name or instance of the second annotation collection, whose annotations will be searched for matches to those in the first.
tag_filter (list, optional) – Which tags should be included. Defaults to
None(all tags).filter_both_ac (bool, optional) – Whether the tag filter should be applied to both annotation collections. Defaults to
False(only applied to the first collection).level (str, optional) – Whether the annotations’ tags or a specified property (prefixed with ‘prop:’) should be compared. Defaults to ‘tag’.
include_empty_annotations (bool, optional) – If
False, only annotations with a matching annotation in the second collection are included. Defaults toTrue.distance (str, optional) – The IAA distance function. Either ‘binary’ or ‘interval’. See the NLTK API for further information. Defaults to ‘binary’.
verbose (bool, optional) – Whether to print results to stdout. Defaults to
True.return_as_dict (bool, optional) – Whether the computed agreement scores should be returned as a dictionary in addition to being printed (assuming
verbose=True). Defaults toFalse, in which case a Pandas DataFrame with a confusion matrix is returned instead.
- calculate_scotts_pi(ac1_name_or_inst, ac2_name_or_inst, tag_filter=None, filter_both_ac=False, level='tag', include_empty_annotations=True, distance='binary', verbose=True)#
Computes the Scott’s Pi inter-annotator-agreement (IAA) metric for two annotation collections. See the demo notebook for details.
- Parameters
ac1_name_or_inst (str) – The name or instance of the first annotation collection, whose annotations form the basis of the computation.
ac2_name_or_inst (str) – The name or instance of the second annotation collection, whose annotations will be searched for matches to those in the first.
tag_filter (list, optional) – Which tags should be included. Defaults to
None(all tags).filter_both_ac (bool, optional) – Whether the tag filter should be applied to both annotation collections. Defaults to
False(only applied to the first collection).level (str, optional) – Whether the annotations’ tags or a specified property (prefixed with ‘prop:’) should be compared. Defaults to ‘tag’.
include_empty_annotations (bool, optional) – If
False, only annotations with a matching annotation in the second collection are included. Defaults toTrue.distance (str, optional) –
The IAA distance function. Either ‘binary’ or ‘interval’. See the NLTK API for further information. Defaults to ‘binary’.
verbose (bool, optional) – Whether to print results to stdout. Defaults to
True.
- Returns
The metric score and a Pandas DataFrame with a confusion matrix.
- Return type
Tuple[Union[Float, Any], pd.DataFrame]
- calculate_cohens_kappa(ac1_name_or_inst, ac2_name_or_inst, tag_filter=None, filter_both_ac=False, level='tag', include_empty_annotations=True, distance='binary', verbose=True)#
Computes the Cohen’s Kappa inter-annotator-agreement (IAA) metric for two annotation collections. See the demo notebook for details.
- Parameters
ac1_name_or_inst (str) – The name or instance of the first annotation collection, whose annotations form the basis of the computation.
ac2_name_or_inst (str) – The name or instance of the second annotation collection, whose annotations will be searched for matches to those in the first.
tag_filter (list, optional) – Which tags should be included. Defaults to
None(all tags).filter_both_ac (bool, optional) – Whether the tag filter should be applied to both annotation collections. Defaults to
False(only applied to the first collection).level (str, optional) – Whether the annotations’ tags or a specified property (prefixed with ‘prop:’) should be compared. Defaults to ‘tag’.
include_empty_annotations (bool, optional) – If
False, only annotations with a matching annotation in the second collection are included. Defaults toTrue.distance (str, optional) –
The IAA distance function. Either ‘binary’ or ‘interval’. See the NLTK API for further information. Defaults to ‘binary’.
verbose (bool, optional) – Whether to print results to stdout. Defaults to
True.
- Returns
The metric score and a Pandas DataFrame with a confusion matrix.
- Return type
Tuple[Union[Float, Any], pd.DataFrame]
- gamma_agreement(annotation_collections, alpha=3, beta=1, delta_empty=0.01, n_samples=30, precision_level=0.01)#
- pygamma_table(annotation_collections='all')#
Concatenates annotation collections to pygamma table.
- Parameters
annotation_collections (Union[str, list], optional) – List of annotation collections. Defaults to ‘all’.
- Returns
Concatenated annotation collections as pd.DataFrame in pygamma format.
- Return type
pd.DataFrame
Examples#
Load a local CATMA project#
If you load a CATMA project you already have cloned only the project’s name and its location are required to load the project:
project = gitma.CatmaProject(
project_name='DemoProject',
projects_directory='../user_projects/'
)
Adding the paramter included_acs, excluded_acs and ac_filter_keyword, you can select which annotation collections
get loaded.
Assuming you have a project with three annotation collections named ‘ac1’, ‘ac2’ and ‘AC3’ you can select the two first annotation
collections by any of these methods:
# option 1
project = gitma.CatmaProject(
project_name='DemoProject',
projects_directory='../user_projects/',
included_acs=['ac1', 'ac2']
)
# option 2
project = gitma.CatmaProject(
project_name='DemoProject',
projects_directory='../user_projects/',
excluded_acs=['AC3']
)
# option 3
project = gitma.CatmaProject(
project_name='DemoProject',
projects_directory='../user_projects/',
ac_filter_keyword='ac'
)
Load a remote CATMA project#
If you load your project from the CATMA gitlab three further paramters are required:
project = gitma.CatmaProject(
project_name='DemoProject',
load_from_gitlab=True,
gitlab_access_token='<your_access_token>',
backup_directory='../user_projects/'
)
By loading a remote CATMA project it will be cloned in the backup_directory.
After you loaded a CATMA project in this directory once you have to load this project as a local project, as demonstrated above.
Plot a cooccurrence network for the annotations in your project#
You can plot coocurrent annotations:
project.cooccurrence_network()
You can customize your network by the following parameters:
project.cooccurrence_network(
annotation_collections=[ # define the included annotation collections
'<your_first_annotation_collection>',
'<your_second_annotation_collection>'
],
level='prop:<your_property>', # set a property as level
character_distance=50, # define which distance is considered cooccurrent
included_tags=None, # define a list with tags included
excluded_tags=None, # define a list with tags excluded
save_as_gexf='my_gephi_file' # save your network as Gephi file
)