Natural Language Processing for Low-resourced Code-switched Colloquial Languages – The Case of Algerian Language

Adouane, Wafia

dc.contributor.author	Adouane, Wafia
dc.date.accessioned	2020-06-09T10:36:12Z
dc.date.available	2020-06-09T10:36:12Z
dc.date.issued	2020-06-09
dc.identifier.isbn	978-91-7833-958-7 (print)
dc.identifier.isbn	978-91-7833-959-4 (pdf)
dc.identifier.uri	http://hdl.handle.net/2077/64548
dc.description.abstract	In this thesis we explore to what extent deep neural networks (DNNs), trained end-to-end, can be used to perform natural language processing tasks for code-switched colloquial languages lacking both large automated data and processing tools, for instance tokenisers, morpho-syntactic and semantic parsers, etc. We opt for an end-to-end learning approach because this kind of data is hard to control due to its high orthographic and linguistic variability. This variability makes it unrealistic to either find a dataset that exhaustively covers all the possible cases that could be used to devise processing tools or to build equivalent rule-based tools from the bottom up. Moreover, all our models are language-independent and do not require access to additional resources, hence we hope that they will be used with other languages or language varieties with similar settings. We deal with the case of user-generated textual data written in Algerian language as naturally produced in social media. We experiment with five natural language processing tasks, namely Code-switch Detection, Semantic Textual Similarity, Spelling Normalisation and Correction, Sentiment Analysis, and Named Entity Recognition. For each task, we created a dataset from user-generated data reflecting the real use of the language. Our experimental results in various setups indicate that end-to-end DNNs combined with character-level representation of the data are promising. Further experiments with advanced models, such as Transformer-based models, could lead to even better results. Completely solving the challenge of code-switched colloquial languages is beyond the scope of this experimental work. Even so, we believe that this work will extend the utility of DNNs trained end-to-end to low-resource settings. Furthermore, the results of our experiments can be used as a baseline for future research.	sv
dc.language.iso	eng	sv
dc.relation.haspart	Wafia Adouane and Simon Dobnik. 2017. “Identification of Languages in Algerian Arabic Multilingual Documents”. In Proceedings of The 3rd Arabic Natural Language Processing Workshop (WANLP), pages 1–8. Association for Computational Linguistics. ::doi:: https://www.aclweb.org/anthology/W17-1301/	sv
dc.relation.haspart	Wafia Adouane, Simon Dobnik, Jean-Philippe Bernardy, and Nasredine Semmar. 2018. “A Comparison of Character Neural Language Model and Boot- strapping for Language Identification in Multilingual Noisy Texts”. In Proceedings of the 2nd Workshop on Subword and Character Level Models in NLP (SCLeM), pages 22–31. Association for Computational Linguistics. ::doi:: https://www.aclweb.org/anthology/W18-1203/	sv
dc.relation.haspart	Wafia Adouane, Jean-Philippe Bernardy, and Simon Dobnik. 2018. “Improving Neural Network Performance by Injecting Background Knowledge: Detecting Code-switching and Borrowing in Algerian texts”. In Proceedings of the 3rd Workshop on Computational Approaches to Linguistic Code-Switching, pages 20–28. Association for Computational Linguistics. ::doi:: https://www.aclweb.org/anthology/W18-3203/	sv
dc.relation.haspart	Wafia Adouane, Jean-Philippe Bernardy, and Simon Dobnik. 2019. “Neural Models for Detecting Binary Semantic Textual Similarity for Algerian and MSA”. In Proceedings of the 4th Arabic Natural Language Processing Workshop (WANLP), pages 78–87. Association for Computational Linguistics. ::doi:: https://www.aclweb.org/anthology/W19-4609/	sv
dc.relation.haspart	Wafia Adouane, Jean-Philippe Bernardy, and Simon Dobnik. 2019. “Normalising Non-standardised Orthography in Algerian Code-switched User-generated Data”. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT), pages 131–140. Association for Computational Linguistics. ::doi:: https://www.aclweb.org/anthology/D19-5518/	sv
dc.relation.haspart	Wafia Adouane, Samia Touileb, and Jean-Philippe Bernardy. 2020. “Identifying Sentiments in Algerian Code-switched User-generated Comments”. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pages 2691–2698. European Language Resources Association. ::doi:: https://www.aclweb.org/anthology/2020.lrec-1.328/	sv
dc.relation.haspart	Wafia Adouane and Jean-Philippe Bernardy. 2020. “When is Multi-task Learning Beneficial for Low-Resource Noisy User-generated Algerian Texts?” In Proceedings of the 4th Workshop on Computational Approaches to Linguistic Code-Switching, pages 17–25. European Language Resources Association. ::doi:: https://www.aclweb.org/anthology/2020.calcs-1.3/	sv
dc.subject	Natural language processing	sv
dc.subject	Deep neural networks	sv
dc.subject	Low-resourced language	sv
dc.subject	Colloquial language	sv
dc.subject	Code-switch	sv
dc.subject	Dialectal Arabic	sv
dc.subject	User-generated data	sv
dc.subject	Non-standardised orthography	sv
dc.subject	Algerian language	sv
dc.title	Natural Language Processing for Low-resourced Code-switched Colloquial Languages – The Case of Algerian Language	sv
dc.type	Text
dc.type.svep	Doctoral thesis	eng
dc.gup.mail	wafia.adouane@gu.se	sv
dc.gup.mail	wafia.gu@gmail.com	sv
dc.type.degree	Doctor of Philosophy	sv
dc.gup.origin	Göteborgs universitet. Humanistiska fakulteten	swe
dc.gup.origin	University of Gothenburg. Faculty of Humanities	eng
dc.gup.department	Department of Philosophy, Linguistics and Theory of Science ; Institutionen för filosofi, lingvistik och vetenskapsteori	sv
dc.gup.defenceplace	September 2, 2020 at 17:00 in C350, Humanisten, Renströmsgatan 6, Gothenburg https://gu-se.zoom.us/j/64726382903?pwd=Vk9GTFd6VENiZXhFcTFJUkpBTzVwdz09	sv
dc.gup.defencedate	2020-09-02
dc.gup.dissdb-fakultet	HF

Files in this item

Name:: gupea_2077_64548_1.pdf
Size:: 1.658Mb
Format:: PDF
Description:: Thesis frame

View/Open

Name:: gupea_2077_64548_2.pdf
Size:: 39.00Kb
Format:: PDF
Description:: Nailing sheet

View/Open

Name:: gupea_2077_64548_3.pdf
Size:: 3.286Mb
Format:: PDF
Description:: Cover

View/Open

This item appears in the following Collection(s)

Doctoral Theses / Doktorsavhandlingar Institutionen för filosofi, lingvistik och vetenskapsteori
Doctoral Theses from University of Gothenburg / Doktorsavhandlingar från Göteborgs universitet

Show simple item record

Natural Language Processing for Low-resourced Code-switched Colloquial Languages – The Case of Algerian Language

Files in this item

This item appears in the following Collection(s)

Related items

Why the pond is not outside the frog? Grounding in contextual representations by neural language models ﻿

Proceedings of the 2022 CLASP Conference on (Dis)embodiment ﻿

Steg för steg. Naturvetenskapligt ämnesspråk som räknas ﻿

Why the pond is not outside the frog? Grounding in contextual representations by neural language models

Proceedings of the 2022 CLASP Conference on (Dis)embodiment

Steg för steg. Naturvetenskapligt ämnesspråk som räknas