Creating Synthetic Dialogue Datasets for NLU Training. An Approach Using Large Language Models

Laszlo, Bogdan2024-06-202024-06-202024-06-20https://hdl.handle.net/2077/81885This thesis explores the topic of using the GPT-4 large language model, to generate high-quality, diverse synthetic dialogue datasets for training Natural Language Understanding (NLU) models in task-oriented dialogue systems. By employing a schema-guided framework and prompt engineering, the study explores whether synthetic data can replace real-world data. The research focuses on domain classification, active intent classification, and slot multi-labelling. Results show that while synthetic datasets can moderately match real-world data, issues like quality and annotation inconsistency persist.engLanguage TechnologyCreating Synthetic Dialogue Datasets for NLU Training. An Approach Using Large Language ModelsCreating Synthetic Dialogue Datasets for NLU Training. An Approach Using Large Language ModelsText