Computational linguistics resources for Indo-Iranian languages

Virk, Shafqat

Abstract

Can computers process human languages? During the last fifty years, two main approaches have been used to find an answer to this question: data- driven (i.e. statistics based) and knowledge-driven (i.e. grammar based). The former relies on the availability of a vast amount of electronic linguistic data and the processing capabilities of modern-age computers, while the latter builds on grammatical rules and classical linguistic theories of language. In this thesis, we use mainly the second approach and elucidate the development of computational (”resource”) grammars for six Indo-Iranian languages: Urdu, Hindi, Punjabi, Persian, Sindhi, and Nepali. We explore different lexical and syntactical aspects of these languages and build their resource grammars using the Grammatical Framework (GF) – a type theo- retical grammar formalism tool. We also provide computational evidence of the similarities/differences between Hindi and Urdu, and report a mechanical development of a Hindi resource grammar starting from an Urdu resource grammar. We use a functor style implementation that makes it possible to share the commonalities between the two languages. Our analysis shows that this sharing is possible upto 94% at the syntax level, whereas at the lexical level Hindi and Urdu differed in 18% of the basic words, in 31% of tourist phrases, and in 92% of school mathematics terms. Next, we describe the development of wide-coverage morphological lexicons for some of the Indo-Iranian languages. We use existing linguistic data from different resources (i.e. dictionaries and WordNets) to build uni-sense and multi-sense lexicons. Finally, we demonstrate how we used the reported grammatical and lexical resources to add support for Indo-Iranian languages in a few existing GF application grammars. These include the Phrasebook, the mathematics grammar library, and the Attempto controlled English grammar. Further, we give the experimental results of developing a wide-coverage grammar based arbitrary text translator using these resources. These applications show the importance of such linguistic resources, and open new doors for future re- search on these languages.

Degree

Doctor of Philosophy

University

University of Gothenburg. IT Faculty

Institution

Department of Computer Science and Engineering

Disputation

Måndagen den 3 juni 2013, kl 10.00, HC2 Chalmers University of Technology

Date of defence

2013-06-03

Date

2014-08-19

Author

Virk, Shafqat

Keywords

Grammatical FrameWork

Indo-Iranian Languages

Resource Grammars

Publication type

doctoral thesis

ISBN

9789162887063

Series/Report no.

Technical report. D (Department of Computer Science and Engineering, Chalmers University of Technology & University of Gothenburg)

Language

eng

Metadata

Show full item record