Interpretable Methods for Information Removal in Text-Based Learning: Exploring the Use of Large Language Models and SHAP as Information Removal Tools in Text Based Learning

No Thumbnail Available

Date

2025-10-08

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

In text analysis, it is sometimes necessary to remove or obscure certain information that should not be included in downstream processing or interpretation. This thesis addresses the challenge of removing such information from text while preserving as much of the remaining content as possible. Existing successful methods typically operate in the embedding space, which makes it difficult to see which specific parts of the text are being changed. In contrast, this thesis proposes a more interpretable approach that operates directly on the text, taking raw text as input and producing rewritten text as output. While one recent method also attempts direct text-based removal using LLMs, this work extends it by exploring a wider range of prompt strategies and, more importantly, by introducing an intermediate step using SHAP. SHAP is used to extract token level importance scores from a classifier predicting the forbidden variable, providing the language model with more targeted guidance on which parts of the text are most relevant to remove. The proposed method was evaluated on two different datasets: one consisting of professional biographies and another of Amazon product reviews. The results indicate that the method successfully removes the forbidden variable in the first dataset while preserving the remaining content. However, it does not succeed in removing the forbidden variable in the second dataset. Across both datasets, the most effective setups included the SHAP-based guidance step, suggesting that SHAP improves the performance of the information removal method. These findings highlight that LLM-based text disentanglement is not a one-size-fits-all solution, but instead requires adaptable strategies depending on the context and the nature of the sensitive information.

Description

Keywords

Natural Language Processing, Large Language Models, Interpretability, Explainability, NLP, LLM, SHAP

Citation

Collections