Creating Synthetic Dialogue Datasets for NLU Training. An Approach Using Large Language Models
No Thumbnail Available
Date
2024-06-20
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This thesis explores the topic of using the GPT-4 large language model, to generate high-quality,
diverse synthetic dialogue datasets for training Natural Language Understanding (NLU) models
in task-oriented dialogue systems.
By employing a schema-guided framework and prompt engineering, the study explores
whether synthetic data can replace real-world data. The research focuses on domain classification,
active intent classification, and slot multi-labelling. Results show that while synthetic
datasets can moderately match real-world data, issues like quality and annotation inconsistency
persist.
Description
Keywords
Language Technology