Unsupervised Learning of Morphology and the Languages of the World
Abstract
This thesis presents work in two areas; Language Technology and Linguistic
Typology.
In the field of Language Technology, a specific problem is addressed: Can a
computer extract a description of word conjugation in a natural language using
only written text in the language? The problem is often referred to as Unsupervised
Learning of Morphology and has a variety of applications, including
Machine Translation, Document Categorization and Information Retrieval. The
problem is also relevant for linguistic theory. We give a comprehensive survey
of work done so far on the problem and then describe a new approach to the
problem as well as a number of applications. The idea is that concatenative
affixation, i.e., how stems and affixes are stringed together to form words, can,
with some success, be modelled simplistically. Essentially, words consist of highfrequency
strings (“affixes”) attached to low-frequency strings (“stems”), e.g.,
as in the English play-ing. Case studies show how this naive model can be used
for stemming, language identification and bootstrapping language description.
There are around 7 000 languages in the world, exhibiting a bewildering
structural diversity. Linguistic Typology is the subfield of linguistics that aíms
to understand this diversity. Many of the languages in the world today are
spoken only by relatively small groups of people and are threatened by extinction
and it is therefore a priority to record them. Language documentation, is and
has been, an extremely decentralised activity, carried out not only by linguists,
but also missionaries, travellers, anthropologists etc foremostly throughout the
past 200 years. There is no central record of which and how many languages have
been described. To meet the priority, we have attempted to list those languages
which are the most poorly described which do not belong to a language family
where some other languages is decently described – a task requiring both analysis
and diligence. Next, the thesis includes typological work on one of the more
tractable aspects of language structure, namely numeral systems, i.e., normed
expressions used to denote exact quantities. In one of the first surveys to cover
the whole world, we look at rare number bases among numeral systems. One
major rarity is base-6-36 systems which are only attested in South/Southwest
New Guinea and we make a special inquiry into its emergence.
Traditionally, linguists have had headaches over what counts as a language
as opposed to a dialect, and have therefore been reluctant to give counts of the
number of languages in a given area. One chapter of the present thesis shows
that, contrary to popular belief, there is an intuitively sound way to count
languages (as opposed to dialects). The only requirement is that, for each pair
of varieties, we are told whether they are mutually intelligible or not.
Parts of work
Hammarström, H. (2005). A New Algorithm for Unsupervised Induction
of Concatenative Morphology In Yli-Jyrä, A., Karttunen, L., and
Karhumäki, J., editors, Finite State Methods in Natural Language Processing:
5th International Workshop, FSMNLP 2005, Helsinki, Finland,
September 1-2, 2005. Revised Papers, volume 4002 of Lecture Notes in
Computer Science, pages 288–289. Springer-Verlag, Berlin. Hammarström, H. (2006a). A naive theory of morphology and an algorithm
for extraction. In Wicentowski, R. and Kondrak, G., editors,
SIGPHON 2006: Eighth Meeting of the Proceedings of the ACL Special Interest
Group on Computational Phonology, 8 June 2006, New York City,
USA, pages 79–88. Association for Computational Linguistics. Hammarström, H. (2006b). Poor man’s stemming: Unsupervised recognition
of same-stem words. In Ng, H. T., Leong, M.-K., Kan, M.-Y., and Ji,
D., editors, Information Retrieval Technology: Proceedings of the Third
Publicatons and Contributions 5
Asia Information retrieval Symposium, AIRS 2006, Singapore, October
2006, volume 4182 of Lecture Notes in Computer Science, pages 323–337.
Springer-Verlag, Berlin. Hammarström, H. (2007a). A fine-grained model for language identification.
In Proceedings of iNEWS-07 Workshop at SIGIR 2007, 23-27 July
2007, Amsterdam, pages 14–20. ACM. Hammarström, H. (2007b). A survey and classification of methods for
(mostly) unsupervised learning of morphology. In NODALIDA 2007,
the 16th Nordic Conference of Computational Linguistics, Tartu, Estonia,
25-26 May 2007. NEALT. Hammarström, H., Thornell, C., Petzell, M., and Westerlund, T. (2008).
Bootstrapping language description: The case of Mpiemo (Bantu A, Central
African Republic). In Proceedings of LREC-2008, pages 3350–3354.
European Language Resources Association (ELRA). Hammarström, H. (2009a). Poor man’s word-segmentation: Unsupervised
morphological analysis for indonesian. In Proceedings of the Third
International Workshop on Malay and Indonesian Language Engineering
(MALINDO). Singapore: ACL. Hammarström, H. (2009b). A Survey of Computational Morphological
Resources for Low-Density Languages Submitted. Forsberg, M., Hammarström, H., and Ranta, A. (2006). Lexicon extraction
from raw text data. In Salakoski, T., Ginter, F., Pyysalo, S., and
Pahikkala, T., editors, Advances in Natural Language Processing: Proceedings
of the 5th International Conference, FinTAL 2006 Turku, Finland,
August 23-25, 2006, volume 4139 of Lecture Notes in Computer Science,
pages 488–499. Springer-Verlag, Berlin. Hammarström, H. (2008a). Automatic annotation of bibliographical references
with target language. In Proceedings of MMIES-2: Wokshop
on Multi-source, Multilingual Information Extraction and Summarization,
pages 57–64. ACL. Hammarström, H. (2008b). Counting languages in dialect continua using
the criterion of mutual intelligibility. Journal of Quantitative Linguistics,
15(1):34–45. Hammarström, H. (2009c). Whence the Kanum base-6 numeral system?
Linguistic Typology, 13(2):305–319.
m. Hammarström, H. (2009d [to appear]). Rarities in numeral systems. In
Wohlgemuth, J. and Cysouw, M., editors, Rara & Rarissima: Collecting
and interpreting unusual characteristics of human languages, Empirical
Approaches to Language Typology, pages 7–55. Mouton de Gruyter. Hammarström, H. (2009e). The Status of the Least Documented Language
Families in the World Submitted.
Degree
Doctor of Engineering
University
Göteborgs universitet. IT-fakulteten
Institution
Department of Computer Science and Engineering ; Institutionen för data- och informationsteknik
Disputation
10:15 in room HB1, Hörsalsväagen 8
Date of defence
2009-12-11
harald@bombo.se
Date
2009-11-16Author
Hammarström, Harald
Keywords
Computational Linguistics
Language typology
Publication type
Doctoral thesis
ISBN
978-91-628-7942-6
Language
eng