Argumentation and agreement: Annotating and evaluating Swedish corpora for argumentation mining
No Thumbnail Available
Date
2025-09-12
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Argumentation occurs in all parts of life, and as such, is studied across disciplines. In natural language processing, the field of argumentation mining aims to develop computational tools that automatically analyze and evaluate argumentation. Such tools have many uses, from automatically grading essays to identifying fallacies. In order to build such tools, annotated data is essential both for training and evaluating, especially with large language models (LLMs). Creating annotated datasets, however, presents significant challenges, not only because of the complexity of argumentation but also methodological questions such as how to represent argumentation and how to evaluate annotation quality.
To create more resources as well as investigate these challenges, in this thesis, I explore several approaches to argumentation annotation. To this end, I also present a comprehensive survey of argumentation annotation. Three annotation approaches of varying complexity are explored: argumentation schemes applied to editorials, argumentative spans to online forums and political debates, and attitude annotation to tweets. The datasets thus represent a wide variety of genres and approaches. Attitude in tweets was found to show the highest agreement among annotators, while annotation of editorials with argumentation schemes was the most challenging.
In the evaluation of the annotations, several types of disagreement were identified. Most saliently, disagreement often occurred in cases where multiple interpretations are possible, challenging agreement as the primary measure of quality. These findings demonstrate the need for more comprehensive evaluation approaches. I therefore demonstrate ways to evaluate beyond single agreement measures: agreement analysis from multiple angles, annotator pattern investigation, and manual inspection of disagreement.
To further explore argumentation annotation, I investigate how two different LLMs annotate argumentation compared to human annotators, finding that while the models exhibit similar annotation behavior as humans, with similar agreement levels and disagreement patterns, the models agree more among themselves than human annotators.
Description
Keywords
natural language processing, argumentation, annotation, argumentation mining, annotation evaluation, large language models, machine learning