An exploratory field study on the use of data management and data quality requirements in ML-enabled software applied in environmental research
No Thumbnail Available
Date
2025-10-07
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Integrating machine learning into environmental science has shown great promise in improving research outcomes. However, the effective application of machine learning and the reliability of the results depend heavily on data quality and management practices, which are often overlooked or addressed inconsistently. It is important to have a proper data pipeline that includes good practices for quality data and data management. This thesis introduces SPADES-ML (Scientific Pipeline Assessment and Data-Centric Evaluation Scorecard for Machine Learning), a structured assessment framework developed to evaluate the quality and transparency of data-related practices in machine learning based research. SPADES-ML is demonstrated through a case study of machine learning based environmental research.
A total of 28 research papers were analysed using SPADES-ML. The framework was applied to assess five critical areas: data selection and suitability, data quality, adherence to the FAIR principles, data preprocessing, and challenges in preprocessing. A survey was conducted to validate the findings targeting practitioners in machine learning based environmental research. Results from the literature and survey analyses revealed recurring challenges in ensuring data quality, reproducibility, and methodological excellence. The analysis of SPADES-ML and the survey revealed recurring challenges in ensuring data quality, reproducibility, and methodological excellence. Furthermore, this study provides initial recommendations to improve data practices in machine learning-based research by adhering software engineering principles in the results. This thesis contributes to the emerging field of research software engineering by offering a structured evaluation and guidelines for robust methodology pipelines in interdisciplinary, machine learning based research.
Description
Keywords
Data-Centric Evaluation, Data Management, Data Quality, Data Quality Challenges, Environmental Research, FAIR, Machine Learning, Methodological Guidelines, Software engineering, SPADES-ML