Yield Prediction with Explainable Machine Learning

Huber, Florian Philipp

Volltext

View/Open (13.2MB)

Author

Huber, Florian Philipp

Type of Scholarly Publication

Dissertation

Date of Exam

31.10.2024

Date of Publication

21.11.2024

Advisor

Steinhage, Volker

Co-Referee

Demidova, Elena

Degree Granting Institutions

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadata

Show full item record

Citable Links

Handle: https://hdl.handle.net/20.500.11811/12560
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-79699

Abstract

Starting from a federal project to predict grapevine yields in Germany, we faced five challenges to enable machine learning for yield prediction. The first challenge is training on small data sets, as capturing data for yield prediction is very time consuming with most plants following an annual cycle. Providing a feature-based representation of remote sensing data by modeling underlying distributions allows gradient boosting methods to outperform deep learning approaches by 25% in our experiments for soybean yield prediction in the US, one of the biggest datasets for yield prediction that allows international comparability. The second challenge is the need for explanations to show that the model's decision making is in-line with experts knowledge of the field. For this challenge, we extend the idea of Shapley value feature attributions to predefined groups of features. The groupings are naturally given for yield prediction scenarios and allow for an improved representation of the explanations, as individual features are plentiful and often abstract. We give a novel algorithm to solve the problem of calculating the grouped Shapley values in polynomial time for random forests as they result from the gradient boosting pipeline from challenge one. Third, we work towards better feature selection for yield prediction tasks. The introduction of grouped Shapley values sparks the question of whether Shapley values could be used for feature selection. To address this question, we define four necessary conditions for defining a Shapley value suitable for feature selection. Additionally, we analyze the problem of model averaging where unimportant features are allowed to alter the final feature selection by introducing a novel exhaustive feature selection tool that has no problems with model averaging, and use it to further evaluate Shapley values for feature selection. Our experiments indicate that there is a small loss in accuracy due to model averaging, while the runtime of Shapley values as a heuristic measure for feature selection is superior for random forests. The fourth challenge is handling gaps in remote sensing data. As we need to use remote sensing data to provide consistent coverage for a small research area, clouds that occlude the satellite's view on the Earth can hide a meaningful amount of data. We approach this challenge by introducing a novel deep interpolation pipeline that uses a U-Net structure together with partial convolutions to gradually fill in remote sensing data in our research area, finally improving previously established statistical methods by 44% in terms of RMSE. Lastly, we worked towards a solution to make predictions for shifting domains, where we used regularized transfer learning to improve yield prediction by transferring knowledge between different domains by 16% in terms of RMSE, compared to not using transfer learning techniques.

Classification (DDC)

004 Informatik

Zitiervorschlag
BibTeX

Huber, Florian Philipp: Yield Prediction with Explainable Machine Learning. - Bonn, 2024. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-79699

@phdthesis{handle:20.500.11811/12560,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-79699,
author = {{Florian Philipp Huber}},
title = {Yield Prediction with Explainable Machine Learning},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2024,
month = nov,
note = {Starting from a federal project to predict grapevine yields in Germany, we faced five challenges to enable machine learning for yield prediction. The first challenge is training on small data sets, as capturing data for yield prediction is very time consuming with most plants following an annual cycle. Providing a feature-based representation of remote sensing data by modeling underlying distributions allows gradient boosting methods to outperform deep learning approaches by 25% in our experiments for soybean yield prediction in the US, one of the biggest datasets for yield prediction that allows international comparability. The second challenge is the need for explanations to show that the model's decision making is in-line with experts knowledge of the field. For this challenge, we extend the idea of Shapley value feature attributions to predefined groups of features. The groupings are naturally given for yield prediction scenarios and allow for an improved representation of the explanations, as individual features are plentiful and often abstract. We give a novel algorithm to solve the problem of calculating the grouped Shapley values in polynomial time for random forests as they result from the gradient boosting pipeline from challenge one. Third, we work towards better feature selection for yield prediction tasks. The introduction of grouped Shapley values sparks the question of whether Shapley values could be used for feature selection. To address this question, we define four necessary conditions for defining a Shapley value suitable for feature selection. Additionally, we analyze the problem of model averaging where unimportant features are allowed to alter the final feature selection by introducing a novel exhaustive feature selection tool that has no problems with model averaging, and use it to further evaluate Shapley values for feature selection. Our experiments indicate that there is a small loss in accuracy due to model averaging, while the runtime of Shapley values as a heuristic measure for feature selection is superior for random forests. The fourth challenge is handling gaps in remote sensing data. As we need to use remote sensing data to provide consistent coverage for a small research area, clouds that occlude the satellite's view on the Earth can hide a meaningful amount of data. We approach this challenge by introducing a novel deep interpolation pipeline that uses a U-Net structure together with partial convolutions to gradually fill in remote sensing data in our research area, finally improving previously established statistical methods by 44% in terms of RMSE. Lastly, we worked towards a solution to make predictions for shifting domains, where we used regularized transfer learning to improve yield prediction by transferring knowledge between different domains by 16% in terms of RMSE, compared to not using transfer learning techniques.},
url = {https://hdl.handle.net/20.500.11811/12560}
}

The following license files are associated with this item: