Argument Differentiation. Soft constraints and data-driven models

Øvrelid, Lilja

Abstract

The ability to distinguish between different types of arguments is central to syntactic analysis, whether studied from a theoretical or computational point of view. This thesis investigates the inﬂuence and interaction of linguistic properties of syntactic arguments in argument differentiation. Cross-linguistic generalizations regarding these properties often express probabilistic, or soft, constraints, rather than absolute requirements on syntactic structure. In language data, we observe frequency effects in the realization of syntactic arguments. We propose that argument differentiation can be studied using data-driven methods which directly express the relationship between frequency distributions in language data and linguistic categories. The main focus in this thesis is on the formulation and empirical evaluation of linguistically motivated features for data-driven modeling. Based on differential properties of syntactic arguments in Scandinavian language data, we investigate the linguistic factors involved in argument differentiation from two different perspectives. We study automatic acquisition of the lexical semantic category of animacy and show that statistical tendencies in argument differentiation supports automatic classiﬁcation of unseen nouns. The classiﬁcation is furthermore robust, generalizable across machine learning algorithms, as well as scalable to larger data sets. We go on to perform a detailed study of the inﬂuence of a range of different linguistic properties, such as animacy, deﬁniteness and ﬁniteness, on argument disambiguation in data-driven dependency parsing of Swedish. By including features capturing these properties in the representations used by the parser, we are able to improve accuracy signiﬁcantly, and in particular for the analysis of syntactic arguments. The thesis shows how the study of soft constraints and gradience in language can be carried out using data-driven models and argues that these provide a controlled setting where different factors may be evaluated and their inﬂuence quantiﬁed. By focusing on empirical evaluation, we come to a better understanding of the results and implications of the datadriven models and furthermore show how linguistic motivation in turn can lead to improved computational models. KEY WORDS: Syntactic arguments, parsing, lexical acquisition, animacy, Scandinavian syntax, soft constraints, data-driven models, machine learning DISTRIBUTION: Dept. of Swedish Language, University of Gothenburg Box 200 405 30 Gothenburg Sweden ISSN: 0347-948X ISBN: 978-91-87850-35-6 COVER ILLUSTRATION: How to describe the world is still an open question by Randi Nygård TYPESET IN LATEX2ε by Lilja Øvrelid PRINTED in Sweden by Intellecta Docusys Mölndal 2008

University

Göteborgs universitet/University of Gothenburg

Institution

Department of Swedish

Institutionen för svenska språket

Disputation

Lilla hörsalen, Humanisten. kl 10:15

Date of defence

2008-05-31

URI

http://hdl.handle.net/2077/17287

Collections

View/Open

Pressmeddelande (3.217Kb)

Thesis (3.127Mb)

Date

2008

Author

Øvrelid, Lilja

Publication type

Doctoral thesis

ISBN

978-91-87850-35-6

ISSN

0347-948X

Series/Report no.

Data linguistica

Metadata

Show full item record