An exploratory field study on the use of data management and data quality requirements in ML-enabled software applied in environmental research

dc.contributor.authorMahagamarachchi, Devasinghage Sara Nirmani
dc.contributor.authorPamali Chathurika, Hikkaduwa Liyanage
dc.contributor.departmentGöteborgs universitet/Institutionen för data- och informationsteknikswe
dc.contributor.departmentUniversity of Gothenburg/Department of Computer Science and Engineeringeng
dc.date.accessioned2025-10-07T12:16:20Z
dc.date.available2025-10-07T12:16:20Z
dc.date.issued2025-10-07
dc.description.abstractIntegrating machine learning into environmental science has shown great promise in improving research outcomes. However, the effective application of machine learning and the reliability of the results depend heavily on data quality and management practices, which are often overlooked or addressed inconsistently. It is important to have a proper data pipeline that includes good practices for quality data and data management. This thesis introduces SPADES-ML (Scientific Pipeline Assessment and Data-Centric Evaluation Scorecard for Machine Learning), a structured assessment framework developed to evaluate the quality and transparency of data-related practices in machine learning based research. SPADES-ML is demonstrated through a case study of machine learning based environmental research. A total of 28 research papers were analysed using SPADES-ML. The framework was applied to assess five critical areas: data selection and suitability, data quality, adherence to the FAIR principles, data preprocessing, and challenges in preprocessing. A survey was conducted to validate the findings targeting practitioners in machine learning based environmental research. Results from the literature and survey analyses revealed recurring challenges in ensuring data quality, reproducibility, and methodological excellence. The analysis of SPADES-ML and the survey revealed recurring challenges in ensuring data quality, reproducibility, and methodological excellence. Furthermore, this study provides initial recommendations to improve data practices in machine learning-based research by adhering software engineering principles in the results. This thesis contributes to the emerging field of research software engineering by offering a structured evaluation and guidelines for robust methodology pipelines in interdisciplinary, machine learning based research.sv
dc.identifier.urihttps://hdl.handle.net/2077/89841
dc.language.isoengsv
dc.setspec.uppsokTechnology
dc.subjectData-Centric Evaluationsv
dc.subjectData Managementsv
dc.subjectData Qualitysv
dc.subjectData Quality Challengessv
dc.subjectEnvironmental Researchsv
dc.subjectFAIRsv
dc.subjectMachine Learningsv
dc.subjectMethodological Guidelinessv
dc.subjectSoftware engineeringsv
dc.subjectSPADES-MLsv
dc.titleAn exploratory field study on the use of data management and data quality requirements in ML-enabled software applied in environmental researchsv
dc.typetext
dc.type.degreeStudent essay
dc.type.uppsokH2

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
CSE 25-26 SM PC.pdf
Size:
15.82 MB
Format:
Adobe Portable Document Format
Description:
Master thesis

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
876 B
Format:
Item-specific license agreed upon to submission
Description:

Collections