AnnotationCollection#

class AnnotationCollection(ac_uuid, catma_project, context=50)#

Bases: object

Class which represents a CATMA annotation collection.

Parameters

ac_uuid (str) – The annotation collection’s UUID
catma_project (CatmaProject) – The parent CatmaProject
context (int, optional) – The text span to be considered for the annotation context. Defaults to 50.

Raises

FileNotFoundError – If the path of the annotation collection’s header.json does not exist.

uuid: str#: The annotation collection’s UUID.

projects_directory: str#: The directory where the parent project is located.

project_uuid: str#: The parent project’s UUID.

directory: str#: The annotation collection’s directory.

name: str#: The annotation collection’s name.

plain_text_id: str#: The UUID of the annotation collection’s document.

text: gitma.text.Text#: The document of the annotation collection as a gitma.Text object.

text_version: str#: The document’s version.

tags: List[gitma.tag.Tag]#: Tags found in the annotation collection as a list of gitma.Tag objects.

annotations: list#: List of annotations in annotation collection as gitma.Annotation objects.

df: pandas.core.frame.DataFrame#: Annotations as a pandas.DataFrame.

to_list(tags=None)#

Returns list of annotations as dictionaries using the Annotation.to_dict() method.

Parameters: tags (Union[list, None]) – Tags included in the annotations list. If None all tags are included. Defaults to None.
Returns: List of annotations as dictionaries.
Return type: List[dict]

annotation_dict()#

Creates dictionary with UUIDs as keys and Annotation objects as values.

Returns: Dictionary with UUIDs as keys and Annotation objects as values.
Return type: Dict[str, Annotation]

duplicate_by_prop(prop)#

Duplicates the rows in the annotation collection’s DataFrame if the given property has multiple property values.

Parameters: prop (str) – A property used in the annotation collection.
Raises: ValueError – If the property has not been used in the annotation collection.
Returns: A duplicate of the annotation collection’s DataFrame.
Return type: pd.DataFrame

push_annotations(commit_message='new annotations')#

Process git add ., git commit and git push for a single annotation collection.

Note: Works only if git is installed and the CATMA access token is stored in the git credential manager.

Parameters: commit_message (str, optional) – Customize the commit message. Defaults to ‘new annotations’.

plot_annotations(y_axis='tag', color_prop=None)#

Creates an interactive Plotly Scatter Plot to explore this annotation collection.

Parameters

y_axis (str, optional) – The column in AnnotationCollection.df DataFrame used for the y-axis. Defaults to ‘tag’.
color_prop (str, optional) – A property’s name used in the annotation collection, prefixed with ‘prop:’. Defaults to None.

Returns

Plotly scatter plot.

Return type

go.Figure

filter_by_tag_path(path_element)#

Filters annotation collection data frame for annotations with the given path_element in the tag’s full path.

Parameters: path_element (str) – Any tag name with the used tagsets.
Returns: Data frame in the format of the annotation collection data frames.
Return type: pd.DataFrame

plot_scaled_annotations(tag_scale=None, bin_size=50, smoothing_window=100)#

Plots a graph with scaled annotations. This function is still under development.

Parameters

tag_scale (dict, optional) – description. Defaults to None.
bin_size (int, optional) – description. Defaults to 50.
smoothing_window (int, optional) – description. Defaults to 100.

Raises

Exception – description

cooccurrence_network(character_distance=100, included_tags=None, excluded_tags=None, level='tag', plot_stats=False, save_as_gexf=False)#

Draws a co-occurrence network graph where every tag is a node and every edge represents two co-occurrent tags. You can by the character_distance parameter when two annotations are considered co-occurrent. If you set character_distance=0 only the tags of overlapping annotations will be represented as connected nodes.

Parameters

character_distance (int, optional) – In which distance annotations are considered co-occurrent. Defaults to 100.
included_tags (list, optional) – List of included tags. Defaults to None.
level (str, optional) – Select ‘tag’ or any property in your annotation collections with the prefix ‘prop:’.
excluded_tags (list, optional) – List of excluded tags. Defaults to None.
plot_stats (bool, optional) – Whether to return network stats. Defaults to False.
save_as_gexf (bool, optional) – If given any string as filename the network gets saved as Gephi file.

to_pygamma_table()#

Returns the annotation collection’s DataFrame in the format pygamma takes as input.

Returns: DataFrame with four columns: ‘annotator’, ‘tag’, ‘start_point’ and ‘end_point’.
Return type: pd.DataFrame

tag_stats(tag_col='tag', stopwords=None, ranking=10)#

Computes the following data for each tag in the annotation collection:

the count of annotations with a tag
the complete text span annotated with a tag
the average text span annotated with a tag
the n-most frequent token in the text span annotated with a tag

Parameters

tag_col (str, optional) – Whether the data for the tag a property or annotators gets computed. Defaults to ‘tag’.
stopwords (list, optional) – A list with stopword tokens. Defaults to None.
ranking (int, optional) – The number of most frequent token to be included. Defaults to 10.

Returns

The data as pandas DataFrame.

Return type

pd.DataFrame

property_stats()#

Counts for each property the property values.

Returns: DataFrame with properties as index and property values as header.
Return type: pd.DataFrame

get_annotation_by_tag(tag_name)#

Creates list of all annotations with a given name.

Parameters: tag_name (str) – The searched tag’s name.
Returns: List of annotations as gitma.Annotation objects.
Return type: List[Annotation]

annotate_properties(tag, prop, value)#

Set value for given property. This function uses the gitma.Annotation.set_property_values() method.

Parameters

tag (str) – The parent tag of the property.
prop (str) – The property to be annotated.
value (list) – The new property value.

rename_property_value(tag, prop, old_value, new_value)#

Renames property value of all annotations with the given tag name. Replaces only the property value defined by the parameter old_value.

Parameters

tag (str) – The tag’s name.
prop (str) – The property’s name.
old_value (str) – The old property value that will be replaced.
new_value (str) – The new property value that will replace the old property value.

delete_properties(tag, prop)#

Deletes a property from all annotations with a given tag name.

Parameters

tag (str) – The annotations tag name.
prop (str) – The name of the property that will be removed.

to_stanford_tsv(tags='all', file_name='tsv_annotation_export', spacy_model='de_core_news_sm')#

Writes a TSV-file for this annotation collection which can be used to train a Stanford NER model. Every token in the associated text gets a tag if it lies within an annotated text segment.

Parameters

tags (Union[list, str], optional) – List of tags that should be considered. If set to ‘all’, all annotations are included. Defaults to ‘all’.
file_name (str, optional) – Name of the TSV-file. Defaults to ‘tsv_annotation_export’.
spacy_model (str, optional) – A spaCy model as listed at https://spacy.io/usage/models. Defaults to ‘de_core_news_sm’. Note that the specified model first needs to be installed, as detailed on the linked page.

write_annotation_csv(tags='all', property='all', only_missing_prop_values=False, filename='PropertyAnnotationTable')#

Creates a CSV file that can be used to add property values to existing annotations. The added property values can be imported with the read_annotation_csv() method.

See the example below.

Parameters

tags (Union[str, list], optional) – List of tag names to be included. If set to ‘all’ all annotations will be written into the csv file. Defaults to ‘all’.
property (str, optional) – The property to be included. If set to ‘all’ all annotations will be written into the csv file. Defaults to ‘all’.
only_missing_prop_values (bool, optional) – Whether only empty properties should be included. Defaults to False.
filename (str, optional) – The name of the output CSV file. Defaults to ‘PropertyAnnotationTable’.

read_annotation_csv(filename='PropertyAnnotationTable.csv', push_to_gitlab=False)#

Reads a CSV file created by the write_annotation_csv() method and updates the annotation JSON files. Additionally, if push_to_gitlab=True the annotations get imported in the CATMA Gitlab backend.

See the example below.

Parameters

filename (str, optional) – The annotation csv file’s name/directory. Defaults to ‘PropertyAnnotationTable.csv’.
push_to_gitlab (bool, optional) – Whether to push the annotations to gitlab. Defaults to False.

Examples#

Add property values via csv table#

The AnnotationCollection class can be used to add property values to existing annotations using a csv table.

Step 1: Load your project and create a csv table to annotate the existing annotations/properties:

import gitma

project = gitma.CatmaProject(
   project_name='<your_project_name>',
   projects_directory='<your_projects_directory>'
)

project.ac_dict['<your_annotation_collection_name>'].write_annotation_csv(
   filename='PropertyAnnotationTable'  # default name
)

The method write_annotation_csv creates a csv table in in this format:

id	annotation_collection	text	tag	property	values
CATMA_95259D62-E441-4009-AD8D-7F5124BF2323	bettelweib-event_type	Am Fuße der Alpen, bei Locarno im oberen Italien, befand sich ein Schloß	stative_event	characters	marquis,marquise
CATMA_AB9A223C-C6F7-495A-817F-ED57E1B69A70	bettelweib-event_type	das man jetzt in Schutt und Trümmern liegen sieht	process_event	intentional	yes
CATMA_AB9A223C-C6F7-495A-817F-ED57E1B69A70	bettelweib-event_type	das man jetzt in Schutt und Trümmern liegen sieht	process_event	characters
CATMA_2A4C8A4E-2842-44D2-B2E2-F9A6AE2B8063	bettelweib-event_type	wenn man vom St. Gotthard kommt	non_event	characters

Step 2: Add property values:

For every property per annotation a table row will be created.

Caution

In these tables only the values column is editable!

If you want to add multiple values for a property seperate the values by a comma. It is recommended to use a csv editor like https://edit-csv.net/. Finish your annotations by saving the csv file.

Step 3: Load the added property values in your CATMA project

After you finished the property annotations within the csv file you can load the annotations to the CATMA gitlab.

project.ac_dict['<your_annotation_collection_name'].read_annotation_csv(
   filename='PropertyAnnotationTable.csv',   # default name
   push_to_gitlab=True                       # default is False
)

Caution

The push to gitlab will only work if you have git installed and your CATMA access token is stored in the git credential manager.

Step 4 (optional): Import your annotation to CATMA

If push_to_gitlab=False you can push the changed annotations to gitlab on your own. To do so the read_annotation_csv method will print the annotation collection’s directory. Use the git bash or another terminal, go to the annotation collection’s directory and run:

git add .
git commit -m 'new property annotations'
git push origin HEAD:master

Tagset

Annotation

Quick search

AnnotationCollection#

Examples#

Add property values via csv table#