Browsing by Author "Khojah, Ranim"

Now showing 1 - 2 of 2

EVALUATING CONFIDENCE ESTIMATION IN NLU FOR DIALOGUE SYSTEMS
(2022-06-20) Khojah, Ranim; University of Gothenburg / Department of Philosophy,Lingustics and Theory of Science; Göteborgs universitet / Institutionen för filosofi, lingvistik och vetenskapsteori
Background: Natural Language Understanding (NLU) is an important component in Dialogue Systems (DS) which makes the utterances of humans understandable by machines. A central aspect of NLU is intent classification. In intent classification, an NLU receives a user utterance, and outputs a list of N ranked hypotheses (an N-best list) of the predicted intent along with a confidence estimation (a real number between 0 and 1) that is assigned to each hypothesis. Objectives: In this study, we perform an in-depth evaluation of the confidence estimation of 5 NLUs, namely Watson Assistant, Language Understanding Intelligent Service (LUIS), Snips.ai and Rasa in two different configurations (Sklearn and DIET). We measure the calibration on two levels: rank level (results for specific ranks) and model level (aggregated results across ranks), as well as the performance on a model level. Calibration here refers to the relation between confidence estimates and true likelihood, i.e. how useful the confidence estimate associated with a certain hypothesis is for assessing its likelihood of being correct. Methodology: We conduct an exploratory case study on the NLUs. We train the NLUs using a subset of a multi-domain dataset proposed by Liu et al. (2021) on intent classification tasks. We assess the calibration of the NLUs on model- and rank levels using reliability diagrams and correlation coefficient with respect to instance-level accuracy, while we measure the performance through accuracy and F1-score. Results: The evaluation results show that on a model level, the best calibrated NLU is Rasa-Sklearn and the least calibrated NLU is Snips, while Watson surpasses other NLUs as the best performing NLU and Rasa-Sklearn as the worst performing NLU. The rank-level results resonate with the model-level results. However, on lower ranks, some measures become less informative due to low variation of the confidence estimates. Conclusion: Our findings convey that when choosing an NLU for a dialogue system, there is a trade-off between calibration and performance, that is, a well-performing NLU is not necessarily well-calibrated, and vice versa. While the chosen metrics of calibration is clearly useful, we also note some limitations and conclude that further investigation is needed to find the optimal metric of calibration. Also, it should be noted that to some extent, our results rest on the assumption that the chosen metrics of calibration is suitable for our purposes.
Evaluating the Trade-offs of Diversity-Based Test Prioritization: An Experiment
(2020-12-03) Khojah, Ranim; Hong Chao, Chi; Göteborgs universitet/Institutionen för data- och informationsteknik; University of Gothenburg/Department of Computer Science and Engineering
Different test prioritization techniques detect faults at earlier stages of test execution. To this end, Diversity-based techniques (DBT) have been cost-effective by prioritizing the most dissimilar test cases to maintain effectiveness and coverage with lower resources at different stages of the software development life cycle, called levels of testing (LoT). Diversity is measured on static test specifications to convey how different test cases are from one another. However, there is little research on DBT applied to semantic similarities of words within tests. Moreover, diversity has been extensively studied within individual LoT (unit, integration and system), but the trade-offs of such techniques across different levels are not well understood. Objective and Methodology: This paper aims to reveal relationships between DBT and the LoT, as well as to compare and evaluate the cost-effectiveness and coverage of different diversity measures, namely Jaccard’s Index, Levenshtein, Normalized Compression Distance (NCD), and Semantic Similarity (SS). We perform an experiment on the test suites of 7 open source projects on the unit level, 1 industrial project on the integration level, and 4 industry projects on the system level (where one project is used on both system and integration levels). Results: Our results show that SS increases test coverage for system-level tests, and the differences in failure detection rate of each diversity increase as more prioritised tests execute. In terms of execution time, we report that Jaccard is the fastest, whereas Levenshtein is the slowest and, in some cases, simply infeasible to run. In contrast, Levenshtein detects more failures on integration level, and Jaccard more on system level. Conclusion: Future work can be done on SS to be implemented on code artefacts, as well as including other DBT in the comparison. Suspected test suite properties that seem to affect DBT performance can be investigated in greater detail.