Gießelbach, Sven Alexander: Data Science with Foundation Models : An Evidence-Based, Comprehensive Project Methodology. - Bonn, 2024. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-76246
@phdthesis{handle:20.500.11811/11566,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-76246,
author = {{Sven Alexander Gießelbach}},
title = {Data Science with Foundation Models : An Evidence-Based, Comprehensive Project Methodology},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2024,
month = may,

note = {In the past decades, the interest in data science has surged across sectors and industries. Data science has been impacted by a series of paradigm shifts, such as the emergence of big data, cloud computing or the demand for developing and operating machine learning-based applications. Project methodologies have continually strived to catch-up, but with limited success, since 80% of data science projects never reach deployment. While initially data science projects ended once a useful model was obtained from carefully engineered data, nowadays models are integrated into software applications which afterwards must be maintained, improved, and adapted. This requires a careful management of all kinds of digital project artifacts.
The latest paradigm shift in machine learning, which has not yet been reflected in any project methodology so far, is due to foundation models, which have established themselves as a de facto standard across various media including text, images, video, and audio. Among them large language models play a central role, as they allow reuse for other tasks, without additional training, by prompting them with natural language. Besides, a foundation model can be fine-tuned to new tasks and domains with limited data, while building a foundation model from scratch requires enormous amounts of big data and careful provisions for alignment. Consequently, the nature of data science projects has changed.
This thesis proposes the first methodology specifically crafted for modern data science projects, placing foundation models at its core. It seamlessly integrates application development phases into the project lifecycle, adopts an encompassing view of project management, and places a particular emphasis on artifact management for cross-project business purposes. A guiding principle of our methodology is the persistent engagement of domain experts and users throughout the project lifecycle. Recognizing the limited availability of these stakeholders, we propose specific tools to enhance their efficiency, particularly in labor-intensive project activities such as data annotation, modeling, and evaluation.
The methodology fully covers a hierarchical catalog of requirements, which comprises 163 groups of project characteristics. The catalog is comprehensive and evidence-based, derived from two exhaustive literature studies. The first study provides a broad and general overview of data science projects, while the second examines the state-of-the-art literature on foundation models. The catalog is validated theoretically by matching it to 8 meta studies on data science methodologies, and practically by case studies from 26 natural language understanding projects.
The catalog provides a framework for assessing data science methodologies. 27 such methodologies are compared, and a gap analysis confirms weaknesses in each methodology. The most prominent gaps are the insufficient support for application development and cross-project artifact management, making them inadequate for modern foundation model projects. All the identified gaps are covered by our proposed methodology, proving it to be both comprehensive, dedicated to projects with foundation models and fostering cross-project reuse.
The ecosystem and frameworks around foundation models are evolving rapidly and we expect them soon to offer tools that can facilitate project management processes in our methodology.},

url = {https://hdl.handle.net/20.500.11811/11566}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden:

InCopyright