AnnotationCollection#

class AnnotationCollection(ac_uuid, catma_project, context=50)#

Bases: object

Class which represents a CATMA annotation collection.

Parameters
  • ac_uuid (str) – The annotation collection’s UUID

  • catma_project (CatmaProject) – The parent CatmaProject

  • context (int, optional) – The text span to be considered for the annotation context. Defaults to 50.

Raises

FileNotFoundError – If the path of the annotation collection’s header.json does not exist.

uuid: str#

The annotation collection’s UUID.

projects_directory: str#

The directory where the parent project is located.

project_uuid: str#

The parent project’s UUID.

directory: str#

The annotation collection’s directory.

name: str#

The annotation collection’s name.

plain_text_id: str#

The UUID of the annotation collection’s document.

text: gitma.text.Text#

The document of the annotation collection as a gitma.Text object.

text_version: str#

The document’s version.

tags: List[gitma.tag.Tag]#

List of tags found in the annotation collection as a list of gitma.Tag objects.

annotations: list#

List of annotations in annotation collection as gitma.Annotation objects.

df: pandas.core.frame.DataFrame#

Annotations as a pandas.DataFrame.

to_list(tags=None)#

Returns list of annotations as dictionaries using the Annotation.to_dict() method.

Parameters

tags (Union[list, None]) – Tags included in the annotations list. If None all tags are included. Defaults to None.

Returns

List of annotations as dictionaries.

Return type

List[dict]

annotation_dict()#

Creates dictionary with UUIDs as keys an Annotation objects as values.

Returns

Dictionary with UUIDs as keys an Annotation objects as values.

Return type

Dict[str, Annotation]

duplicate_by_prop(prop)#

Duplicates the rows in the annotation collection’s DataFrame if the given Property has multiple Property Values the annotations represented by a DataFrame row.

Parameters

prop (str) – A property used in the annotation collection.

Raises

ValueError – If the property has not been used in the annotation collection.

Returns

A duplicate of the annotation collection’s DataFrame.

Return type

pd.DataFrame

push_annotations(commit_message='new annotations')#

Process git add ., git commit and git push for a single annotation collection.

Note: Works only if git is installed and the CATMA access token is stored in the git credential manager.

Parameters

commit_message (str, optional) – Customize the commit message. Defaults to ‘new annotations’.

plot_annotations(y_axis='tag', color_prop='tag')#

Creates interactive Plotly Scatter Plot to a explore a annotation collection.

Parameters
  • y_axis (str, optional) – The columns in AnnotationCollection DataFrame used for y axis. Defaults to ‘tag’.

  • color_prop (str, optional) – A Property’s name used in the AnnotationCollection . Defaults to None.

Returns

Plotly scatter plot.

Return type

go.Figure

filter_by_tag_path(path_element)#

Filters annotation collection data frame for annations with the given path_element in the tag’s full path.

Parameters

path_element (str) – Any tag name with the used tagsets.

Returns

Data frame in the format of the annotation collection data frames.

Return type

pd.DataFrame

plot_scaled_annotations(tag_scale=None, bin_size=50, smoothing_window=100)#

Plots a graph with scaled annotations. This function is still under development.

Parameters
  • tag_scale (dict, optional) – description. Defaults to None.

  • bin_size (int, optional) – description. Defaults to 50.

  • smoothing_window (int, optional) – description. Defaults to 100.

Raises

Exceptiondescription

cooccurrence_network(character_distance=100, included_tags=None, excluded_tags=None, level='tag', plot_stats=False, save_as_gexf=False)#

Draws cooccurrence network graph where every tag is a node and every edge represents two cooccurent tags. You can by the character_distance parameter when two annotations are considered cooccurent. If you set character_distance=0 only the tags of overlapping annotations will be represented as connected nodes.

Parameters
  • character_distance (int, optional) – In which distance annotations are considered coocurrent. Defaults to 100.

  • included_tags (list, optional) – List of included tags. Defaults to None.

  • level (str, optional) – Select ‘tag’ or any property in your annotation collections with the prefix ‘prop’.

  • excluded_tags (list, optional) – List of excluded tags. Defaults to None.

  • plot_stats (bool, optional) – Whether to return network stats. Defaults to False.

  • save_as_gexf (bool, optional) – If given any string as filename the network gets saved as Gephi file.

to_pygamma_table()#

Returns the annotation collection’s DataFrame in the format pygamma takes as input.

Returns

DataFrame with four columns: ‘annotator’, ‘tag’, ‘start_point’ and ‘end_point’.

Return type

pd.DataFrame

tag_stats(tag_col='tag', stopwords=None, ranking=10)#

Computes the following data for each tag in the annotation collection:

  • the count of annotations with a tag

  • the complete text span annotated with a tag

  • the average text span annotated with a tag

  • the n-most frequent token in the text span annotated with a tag

Parameters
  • tag_col (str, optional) – Whether the data for the tag a property or annotators gets computed. Defaults to ‘tag’.

  • stopwords (list, optional) – A list with stopword tokens. Defaults to None.

  • ranking (int, optional) – The number of most frequent token to be included. Defaults to 10.

Returns

The data as pandas DataFrame.

Return type

pd.DataFrame

property_stats()#

Counts for each property the property values.

Returns

DataFrame with properties as index and property values as header.

Return type

pd.DataFrame

get_annotation_by_tag(tag_name)#

Creates list of all annotations with a given name.

Parameters

tag_name (str) – The searched tag’s name.

Returns

List of annotations as gitma.Annotation objects.

Return type

List[Annotation]

annotate_properties(tag, prop, value)#

Set value for given property. This function uses the gitma.Annotation.set_property_values() method.

Parameters
  • tag (str) – The parent tag of the property.

  • prop (str) – The property to be annotated.

  • value (list) – The new property value.

rename_property_value(tag, prop, old_value, new_value)#

Renames Property of all annotations with the given tag name. Replaces only the property value defined by the parameter old_value.

Parameters
  • tag (str) – The tag’s name-

  • prop (str) – The property’s name-

  • old_value (str) – The old property value that will be replaced.

  • new_value (str) – The new property value that will replace the old property value.

delete_properties(tag, prop)#

Deletes a property from all annotations with a given tag name.

Parameters
  • tag (str) – The annotations tag name.

  • prop (str) – The name of the property that will be removed.

to_stanford_tsv(tags='all', file_name='tsv_annotation_export', spacy_model='de_core_news_sm')#

Takes a CATMA AnnotationCollection and writes a tsv-file which can be used to train a stanford NER model. Every token in the collection’s text gets a tag if it lays in an annotated text segment.

Parameters
  • tags (Union[list, str], optional) – List of tags, that should be considered. If set to ‘all’ all annotations are included. Defaults to ‘all’.

  • file_name (str, optional) – name of the tsv-file. Defaults to ‘tsv_annotation_export’.

  • spacy_model (str, optional) – a spacy model as listed in https://spacy.io/usage/models. Default to ‘de_core_news_sm’.

write_annotation_csv(tags='all', property='all', only_missing_prop_values=False, filename='PropertyAnnotationTable')#

Creates csv file to add propertiy values to existing annotations. The added property values can be imported with the read_annotation_csv() method.

See the example below.

Parameters
  • tags (Union[str, list], optional) – List of tag names to be included. If set to ‘all’ all annotations will be written into the csv file. Defaults to ‘all’.

  • property (str, optional) – The property to be included. If set to ‘all’ all annotations will be written into the csv file. Defaults to ‘all’.

  • only_missing_prop_values (bool, optional) – Whether only empy properties should be included. Defaults to False.

  • filename (str, optional) – The csv file name. Defaults to ‘PropertyAnnotationTable’.

read_annotation_csv(filename='PropertyAnnotationTable.csv', push_to_gitlab=False)#

Reads csv file created by the write_annotation_csv() method and updates the annotation json files. Additionally, if push_to_gitlab=True the annotations get imported in the CATMA Gitlab backend.

See the example below.

Parameters
  • filename (str, optional) – The annotation csv file’s name/directory. Defaults to ‘PropertyAnnotationTable.csv’.

  • push_to_gitlab (bool, optional) – Whether to push the annotations to gitlab. Defaults to False.

Examples#

Add property values via csv table#

The AnnotationCollection class can be used to add property values to existing annotations using a csv table.

Step 1: Load your project and create a csv table to annotate the existing annotations/properties:

import gitma

project = gitma.CatmaProject(
   project_name='<your_project_name>',
   projects_directory='<your_projects_directory>'
)

project.ac_dict['<your_annotation_collection_name>'].write_annotation_csv(
   filename='PropertyAnnotationTable'  # default name
)

The method write_annotation_csv creates a csv table in in this format:

id

annotation_collection

text

tag

property

values

CATMA_95259D62-E441-4009-AD8D-7F5124BF2323

bettelweib-event_type

Am Fuße der Alpen, bei Locarno im oberen Italien, befand sich ein Schloß

stative_event

characters

marquis,marquise

CATMA_AB9A223C-C6F7-495A-817F-ED57E1B69A70

bettelweib-event_type

das man jetzt in Schutt und Trümmern liegen sieht

process_event

intentional

yes

CATMA_AB9A223C-C6F7-495A-817F-ED57E1B69A70

bettelweib-event_type

das man jetzt in Schutt und Trümmern liegen sieht

process_event

characters

CATMA_2A4C8A4E-2842-44D2-B2E2-F9A6AE2B8063

bettelweib-event_type

wenn man vom St. Gotthard kommt

non_event

characters

Step 2: Add property values:

For every property per annotation a table row will be created.

Caution

In these tables only the values column is editable!

If you want to add multiple values for a property seperate the values by a comma. It is recommended to use a csv editor like https://edit-csv.net/. Finish your annotations by saving the csv file.

Step 3: Load the added property values in your CATMA project

After you finished the property annotations within the csv file you can load the annotations to the CATMA gitlab.

project.ac_dict['<your_annotation_collection_name'].read_annotation_csv(
   filename='PropertyAnnotationTable.csv',   # default name
   push_to_gitlab=True                       # default is False
)

Caution

The push to gitlab will only work if you have git installed and your CATMA access token is stored in the git credential manager.

Step 4 (optional): Import your annotation to CATMA

If push_to_gitlab=False you can push the changed annotations to gitlab on your own. To do so the read_annotation_csv method will print the annotation collection’s directory. Use the git bash or another terminal, go to the annotation collection’s directory and run:

git add .
git commit -m 'new property annotations'
git push origin HEAD:master