Resources and Applications for Dialectal Arabic: the Case of Levantine

Qwaider, Chatrine; Abu Kwaik, Kathrein

dc.contributor.author	Qwaider, Chatrine
dc.contributor.author	Abu Kwaik, Kathrein
dc.date.accessioned	2022-05-03T11:03:44Z
dc.date.available	2022-05-03T11:03:44Z
dc.date.issued	2022-05-03
dc.identifier.isbn	978-91-8009-803-8 (print) 978-91-8009-804-5(pdf)
dc.identifier.uri	https://hdl.handle.net/2077/71096
dc.description.abstract	This is a thesis about the computational study of Dialectal Arabic (DA). In particular, the thesis studies DA, with a special emphasis on Levantine Arabic, and develops tools and resources for the computational study of Dialectal Arabic Natural Language Processing (DANLP). It investigates the creation of fine-grained resources that can be used for a variety of computational tasks, and a number of effective models that can deal with the complexity of fine-grained dialectal data. Dialect Identification (DI), as well as Sentiment Analysis (SA) are the Natural Language Processing (NLP) tasks investigated in this thesis. In the first part (Study 1 and Study 2), I study the DI task on both coarse-grained and fine-grained levels. For this reason, I build the first annotated Levantine (SHAMI) Dialect Corpus (SDC). Furthermore, I explore the ability of n-gram language models, Machine Learning (ML) algorithms and ensemble learning techniques to classify and detect 26 Arabic varieties. In the second part, I conduct a linguistic study to measure the lexical distance between MSA and DA, and between the dialects themselves. This is done to check whether transferring knowledge from one variety to another is possible. In the third part, studies 4,5 and 6, I explore Arabic Sentiment Analysis (SA). I investigate the idea of knowledge transfer between MSA and the dialects using SA as a case study. Furthermore, I implement various models such as the pre-trained language model BERT, Deep Learning (DL), ML and feature engineering approaches to detect the sentimental polarity of DA data. I introduce two valuable resources for this task, one focusing on Levantine sentiment (Shami-Senti), and the other for DA in general (ATSAD). I exploit different ways of annotation, e.g. human, lexicon-based and automatic distant supervision annotation. The last study is about choosing the best model for DI and SA. I exploit well-known models and approaches using various kinds of DA resources. The thesis contributes to the field of DANLP in a number of ways. The introduced valuable resources can be seen as a stepping stone for a deeper investigation and understanding of issues in DANLP. They are also reliable and can be used by researchers to address different NLP tasks. The cross-dialectal linguistic studies will open up prospects for researchers to fine-tune models and transfer knowledge among Arabic varieties. A big part of the contribution lies in designing DI and SA models. I implement several ML models that use feature engineering approaches and N-gram language models to identify the dialect or detect the sentiment. For DI, I design and implement an ensemble learning model that is able to handle fine-grained dialects. Additionally, I exploit the usage of DL models on different SA dialectal datasets and achieve competitive results. For both tasks, I exploit the recent pre-trained language models and perform a comparison to choose the best model. I also implement a semi-supervised approach for automatic labelling and annotating data with the help of self-training techniques to improve the performance of the dataset. These models will help researchers dive deeper into DANLP and create practical and industrial systems.	en_US
dc.language.iso	eng	en_US
dc.relation.haspart	Kathrein Abu Kwaik, Motaz Saad, Stergios Chatzikyriakidis and Simon Dobnik . "Shami: A corpus of levantine Arabic dialects." In proceedings of the Eleventh International Conference on Language Resources and Eval- uation (LREC 2018). 2018 https://aclanthology.org/L18-1576.pdf	en_US
dc.relation.haspart	Kathrein Abu Kwaik and Motaz K Saad. "ArbDialectID at MADAR Shared Task 1: Language Modelling and Ensemble Learning for Fine Grained Ara- bic Dialect Identification." In ArbDialectID at MADAR Shared Task 1: Lan- guage Modelling and Ensemble Learning for Fine-Grained Arabic Dialect Identification. In proceedings of the Fourth Arabic Natural Language Pro- cessing Workshop (2019) https://aclanthology.org/W19-4632.pdf	en_US
dc.relation.haspart	Kwaik, Kathrein Abu, Motaz Saad, Stergios Chatzikyriakidis and Simon Dobnik. "A Lexical Distance Study of Arabic Dialects." Procedia computer science 142, (2018): pp. 2-13. https://reader.elsevier.com/reader/sd/pii/S1877050918321562?token=6C8E8526DC9631AFE8D86D991307A11538589B3E14E325E6618895F005DBAE5FFE06C696F554CBBAA679E25B216AA28D&originRegion=eu-west-1&originCreation=20220422083106	en_US
dc.relation.haspart	Chatrine Qwaider, Stergios Chatzikyriakidis and Simon Dobnik. "Can Mod- ern Standard Arabic Approaches be used for Arabic Dialects? Sentiment Analysis as a Case Study." In proceedings of the 3rd Workshop on Arabic Corpus Linguistics, pp. 40-50. 2019. https://aclanthology.org/W19-5606.pdf	en_US
dc.relation.haspart	Kathrein Abu Kwaik, Motaz Saad, Stergios Chatzikyriakidis and Simon Dobnik. "LSTM-CNN Deep Learning Model for Sentiment Analysis of Di- alectal Arabic." In proceedings of the International Conference on Arabic Language Processing, pp. 108-121. Springer, Cham, 2019. https://www.stergioschatzikyriakidis.com/uploads/1/0/3/6/10363759/icalp_deep_learning.pdf	en_US
dc.relation.haspart	Kathrein Abu Kwaik, Stergios Chatzikyriakidis, Simon Dobnik, Motaz Saad and Richard Johansson. "An Arabic Tweets Sentiment Analysis Dataset (ATSAD) using Distant Supervision and Self Training." In proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pp. 1-8. 2020. https://aclanthology.org/2020.osact-1.1.pdf	en_US
dc.relation.haspart	Kathrein Abu Kwaik, Stergios Chatzikyriakidis and Simon Dobnik "Pre- trained models or feature engineering? The case of Arabic Dialectal Identi- fication and Sentiment Analysis"	en_US
dc.subject	Dialectal Arabic Natural Language Processing	en_US
dc.subject	Computational Linguistics	en_US
dc.subject	Dialect Identification	en_US
dc.subject	Sentiment Analysis	en_US
dc.subject	Machine Learning	en_US
dc.subject	Deep Learning	en_US
dc.subject	Language modelling	en_US
dc.subject	Natural Language processing	en_US
dc.title	Resources and Applications for Dialectal Arabic: the Case of Levantine	en_US
dc.type	Text
dc.type.svep	Doctoral thesis	eng
dc.gup.mail	chatrine.qwaider@chalmers.se	en_US
dc.gup.mail	kathrein.abu.kwaik@gu.se	en_US
dc.type.degree	Doctor of Philosophy	en_US
dc.gup.origin	Göteborgs universitet. Humanistiska fakulteten	swe
dc.gup.origin	University of Gothenburg. Faculty of Humanities	eng
dc.gup.department	Department of Philosophy, Linguistics and Theory of Science ; Institutionen för filosofi, lingvistik och vetenskapsteori	en_US
dc.gup.defenceplace	Onsdag den 25 maj 2022, kl. 15:00, May 25, 2022, J439, Lilla Hörsalen, Humanisten, Renströmsgatan 6, Gothenburg.	en_US
dc.gup.defencedate	2022-05-25
dc.gup.dissdb-fakultet	HF

Filer under denna titel

Namn:: 170462 Chatrine Qwaider spikbl ...
Storlek:: 533.5Kb
Format:: PDF
Description:: spikblad

Fil(er)

Namn:: 170462 Chatrine Qwaider_espik ...
Storlek:: 13.43Mb
Format:: PDF
Description:: Thesis frame

Fil(er)

Namn:: 170462 Chatrine Qwaider omslag ...
Storlek:: 4.829Mb
Format:: PDF
Description:: Cover

Fil(er)

Dokumentet tillhör följande samling(ar)

Doctoral Theses / Doktorsavhandlingar Institutionen för filosofi, lingvistik och vetenskapsteori

Visa enkel post