Data linguistica

Permanent URI for this collectionhttps://gupea-staging.ub.gu.se/handle/2077/18177

Department of Swedish Language, University of Gothenburg

This series is intended as a forum for works in natural language processing/language technology

Browse

Recent Submissions

Now showing 1 - 3 of 3
  • Item
    Resolving power of search keys in MedEval, a Swedish medical test collection with user groups: doctors and patients
    (2010-09-17) Friberg Heppin, Karin
    This thesis describes the making of a Swedish medical text collection, unique in its kind in providing a possibility to choose user group: doctors or patients. The thesis also describes a series of pilot studies which demonstrate what kind of studies can be performed with such a collection. The pilot studies are focused on search key effectivity: What makes a search key good, and what makes a search key bad? The need to bring linguistics and consideration of terminology into the information retrieval research field is demonstrated. Most information retrieval is about finding free text documents. Documents are built of terms, as are topics and search queries. It is important to understand the functions and features of these terms and not treat them like featureless objects. The thesis concludes that terms are not equal, but show very different behavior. The thesis addresses the problem of compounds, which, if used as search keys, will not match corresponding simplex words in the documents, while simplex words as search keys will not match corresponding compounds in the documents. The thesis discusses how compounds can be split to obtain more matches, without lowering the quality of a search. Another important aspect of the thesis is that it considers how different language registers, in this case those of doctors and patients, can be utilized to find documents written with one of the groups in mind. As the test collection contains a large set of documents marked for intended target group, doctors or patients, the language differences can be and are studied. The author comes up with suggestions of how to choose search keys if documents from one category or the other are desired. Information retrieval is a multi-disciplinary research field. It involves computer science, information science, and natural language processing. There is a substantial amount of research behind the algorithms of modern search engines, but even with the best possible search algorithm the result of a search will not be successful without an effective query constructed with effective search keys.
  • Item
    Argument Differentiation. Soft constraints and data-driven models
    (2008) Øvrelid, Lilja
    The ability to distinguish between different types of arguments is central to syntactic analysis, whether studied from a theoretical or computational point of view. This thesis investigates the influence and interaction of linguistic properties of syntactic arguments in argument differentiation. Cross-linguistic generalizations regarding these properties often express probabilistic, or soft, constraints, rather than absolute requirements on syntactic structure. In language data, we observe frequency effects in the realization of syntactic arguments. We propose that argument differentiation can be studied using data-driven methods which directly express the relationship between frequency distributions in language data and linguistic categories. The main focus in this thesis is on the formulation and empirical evaluation of linguistically motivated features for data-driven modeling. Based on differential properties of syntactic arguments in Scandinavian language data, we investigate the linguistic factors involved in argument differentiation from two different perspectives. We study automatic acquisition of the lexical semantic category of animacy and show that statistical tendencies in argument differentiation supports automatic classification of unseen nouns. The classification is furthermore robust, generalizable across machine learning algorithms, as well as scalable to larger data sets. We go on to perform a detailed study of the influence of a range of different linguistic properties, such as animacy, definiteness and finiteness, on argument disambiguation in data-driven dependency parsing of Swedish. By including features capturing these properties in the representations used by the parser, we are able to improve accuracy significantly, and in particular for the analysis of syntactic arguments. The thesis shows how the study of soft constraints and gradience in language can be carried out using data-driven models and argues that these provide a controlled setting where different factors may be evaluated and their influence quantified. By focusing on empirical evaluation, we come to a better understanding of the results and implications of the datadriven models and furthermore show how linguistic motivation in turn can lead to improved computational models. KEY WORDS: Syntactic arguments, parsing, lexical acquisition, animacy, Scandinavian syntax, soft constraints, data-driven models, machine learning DISTRIBUTION: Dept. of Swedish Language, University of Gothenburg Box 200 405 30 Gothenburg Sweden ISSN: 0347-948X ISBN: 978-91-87850-35-6 COVER ILLUSTRATION: How to describe the world is still an open question by Randi Nygård TYPESET IN LATEX2ε by Lilja Øvrelid PRINTED in Sweden by Intellecta Docusys Mölndal 2008