Creating Synthetic Dialogue Datasets for NLU Training. An Approach Using Large Language Models

No Thumbnail Available

Date

2024-06-20

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis explores the topic of using the GPT-4 large language model, to generate high-quality, diverse synthetic dialogue datasets for training Natural Language Understanding (NLU) models in task-oriented dialogue systems. By employing a schema-guided framework and prompt engineering, the study explores whether synthetic data can replace real-world data. The research focuses on domain classification, active intent classification, and slot multi-labelling. Results show that while synthetic datasets can moderately match real-world data, issues like quality and annotation inconsistency persist.

Description

Keywords

Language Technology

Citation