Unsupervised Learning of Morphology and the Languages of the World

Hammarström, Harald

dc.contributor.author	Hammarström, Harald
dc.date.accessioned	2009-11-16T10:20:28Z
dc.date.available	2009-11-16T10:20:28Z
dc.date.issued	2009-11-16T10:20:28Z
dc.identifier.isbn	978-91-628-7942-6
dc.identifier.uri	http://hdl.handle.net/2077/21418
dc.description.abstract	This thesis presents work in two areas; Language Technology and Linguistic Typology. In the field of Language Technology, a specific problem is addressed: Can a computer extract a description of word conjugation in a natural language using only written text in the language? The problem is often referred to as Unsupervised Learning of Morphology and has a variety of applications, including Machine Translation, Document Categorization and Information Retrieval. The problem is also relevant for linguistic theory. We give a comprehensive survey of work done so far on the problem and then describe a new approach to the problem as well as a number of applications. The idea is that concatenative affixation, i.e., how stems and affixes are stringed together to form words, can, with some success, be modelled simplistically. Essentially, words consist of highfrequency strings (“affixes”) attached to low-frequency strings (“stems”), e.g., as in the English play-ing. Case studies show how this naive model can be used for stemming, language identification and bootstrapping language description. There are around 7 000 languages in the world, exhibiting a bewildering structural diversity. Linguistic Typology is the subfield of linguistics that aíms to understand this diversity. Many of the languages in the world today are spoken only by relatively small groups of people and are threatened by extinction and it is therefore a priority to record them. Language documentation, is and has been, an extremely decentralised activity, carried out not only by linguists, but also missionaries, travellers, anthropologists etc foremostly throughout the past 200 years. There is no central record of which and how many languages have been described. To meet the priority, we have attempted to list those languages which are the most poorly described which do not belong to a language family where some other languages is decently described – a task requiring both analysis and diligence. Next, the thesis includes typological work on one of the more tractable aspects of language structure, namely numeral systems, i.e., normed expressions used to denote exact quantities. In one of the first surveys to cover the whole world, we look at rare number bases among numeral systems. One major rarity is base-6-36 systems which are only attested in South/Southwest New Guinea and we make a special inquiry into its emergence. Traditionally, linguists have had headaches over what counts as a language as opposed to a dialect, and have therefore been reluctant to give counts of the number of languages in a given area. One chapter of the present thesis shows that, contrary to popular belief, there is an intuitively sound way to count languages (as opposed to dialects). The only requirement is that, for each pair of varieties, we are told whether they are mutually intelligible or not.	en
dc.language.iso	eng	en
dc.relation.haspart	Hammarström, H. (2005). A New Algorithm for Unsupervised Induction of Concatenative Morphology In Yli-Jyrä, A., Karttunen, L., and Karhumäki, J., editors, Finite State Methods in Natural Language Processing: 5th International Workshop, FSMNLP 2005, Helsinki, Finland, September 1-2, 2005. Revised Papers, volume 4002 of Lecture Notes in Computer Science, pages 288–289. Springer-Verlag, Berlin.	en
dc.relation.haspart	Hammarström, H. (2006a). A naive theory of morphology and an algorithm for extraction. In Wicentowski, R. and Kondrak, G., editors, SIGPHON 2006: Eighth Meeting of the Proceedings of the ACL Special Interest Group on Computational Phonology, 8 June 2006, New York City, USA, pages 79–88. Association for Computational Linguistics.	en
dc.relation.haspart	Hammarström, H. (2006b). Poor man’s stemming: Unsupervised recognition of same-stem words. In Ng, H. T., Leong, M.-K., Kan, M.-Y., and Ji, D., editors, Information Retrieval Technology: Proceedings of the Third Publicatons and Contributions 5 Asia Information retrieval Symposium, AIRS 2006, Singapore, October 2006, volume 4182 of Lecture Notes in Computer Science, pages 323–337. Springer-Verlag, Berlin.	en
dc.relation.haspart	Hammarström, H. (2007a). A fine-grained model for language identification. In Proceedings of iNEWS-07 Workshop at SIGIR 2007, 23-27 July 2007, Amsterdam, pages 14–20. ACM.	en
dc.relation.haspart	Hammarström, H. (2007b). A survey and classification of methods for (mostly) unsupervised learning of morphology. In NODALIDA 2007, the 16th Nordic Conference of Computational Linguistics, Tartu, Estonia, 25-26 May 2007. NEALT.	en
dc.relation.haspart	Hammarström, H., Thornell, C., Petzell, M., and Westerlund, T. (2008). Bootstrapping language description: The case of Mpiemo (Bantu A, Central African Republic). In Proceedings of LREC-2008, pages 3350–3354. European Language Resources Association (ELRA).	en
dc.relation.haspart	Hammarström, H. (2009a). Poor man’s word-segmentation: Unsupervised morphological analysis for indonesian. In Proceedings of the Third International Workshop on Malay and Indonesian Language Engineering (MALINDO). Singapore: ACL.	en
dc.relation.haspart	Hammarström, H. (2009b). A Survey of Computational Morphological Resources for Low-Density Languages Submitted.	en
dc.relation.haspart	Forsberg, M., Hammarström, H., and Ranta, A. (2006). Lexicon extraction from raw text data. In Salakoski, T., Ginter, F., Pyysalo, S., and Pahikkala, T., editors, Advances in Natural Language Processing: Proceedings of the 5th International Conference, FinTAL 2006 Turku, Finland, August 23-25, 2006, volume 4139 of Lecture Notes in Computer Science, pages 488–499. Springer-Verlag, Berlin.	en
dc.relation.haspart	Hammarström, H. (2008a). Automatic annotation of bibliographical references with target language. In Proceedings of MMIES-2: Wokshop on Multi-source, Multilingual Information Extraction and Summarization, pages 57–64. ACL.	en
dc.relation.haspart	Hammarström, H. (2008b). Counting languages in dialect continua using the criterion of mutual intelligibility. Journal of Quantitative Linguistics, 15(1):34–45.	en
dc.relation.haspart	Hammarström, H. (2009c). Whence the Kanum base-6 numeral system? Linguistic Typology, 13(2):305–319. m. Hammarström, H. (2009d [to appear]). Rarities in numeral systems. In Wohlgemuth, J. and Cysouw, M., editors, Rara & Rarissima: Collecting and interpreting unusual characteristics of human languages, Empirical Approaches to Language Typology, pages 7–55. Mouton de Gruyter.	en
dc.relation.haspart	Hammarström, H. (2009e). The Status of the Least Documented Language Families in the World Submitted.	en
dc.subject	Computational Linguistics	en
dc.subject	Language typology	en
dc.title	Unsupervised Learning of Morphology and the Languages of the World	en
dc.type	Text
dc.type.svep	Doctoral thesis
dc.gup.mail	harald@bombo.se	en
dc.type.degree	Doctor of Engineering	en
dc.gup.origin	Göteborgs universitet. IT-fakulteten	en
dc.gup.department	Department of Computer Science and Engineering ; Institutionen för data- och informationsteknik	en
dc.citation.doi	ITF
dc.gup.defenceplace	10:15 in room HB1, Hörsalsväagen 8	en
dc.gup.defencedate	2009-12-11
dc.gup.dissdb-fakultet	ITF

Files in this item

Name:: gupea_2077_21418_1.pdf
Size:: 2.350Mb
Format:: PDF
Description:: PhD, Fulltext

View/Open

Name:: gupea_2077_21418_2.pdf
Size:: 97.21Kb
Format:: PDF
Description:: Abstract

View/Open

Name:: gupea_2077_21418_16.pdf
Size:: 76.84Mb
Format:: PDF
Description:: Thesis

View/Open

This item appears in the following Collection(s)

Doctoral Theses / Doktorsavhandlingar Institutionen för data- och informationsteknik
Doctoral Theses from University of Gothenburg / Doktorsavhandlingar från Göteborgs universitet

Show simple item record