Efficient Instance Segmentation for Agricultural Applications

Li, Maohui

Volltext

Dieses Dokument ist zur Zeit gesperrt.

Sperrfrist

01.04.2027

Autor

Li, Maohui

Art der Hochschulschrift

Dissertation

Prüfungsdatum

24.02.2026

Datum der Veröffentlichung

27.03.2026

Erstgutachter

McCool, Chris

Zweitgutachter

Stachniss, Cyrill

Grad-verleihende Institutionen

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadaten

Zur Langanzeige

Zitierbare Links

Handle: https://hdl.handle.net/20.500.11811/14019
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-89046

Inhalt

Precise and smart farming technologies, such as agricultural robots, have emerged as crucial solutions to address the challenges of food security and environmental sustainability. Equipped with advanced computer vision capabilities, these systems can automate tasks such as crop monitoring, pest detection, and yield estimation. However, their deployment is often limited by the high computational and memory demands of the vision models. To enable real-time agricultural robotics, efficient yet high-performing deep learning models are required.
To address this challenge, this thesis investigates techniques for compressing large models into lightweight structures that incur only minor performance degradation. We compress both convolutional neural network (CNN)-based and transformer-based models for dense prediction tasks to develop efficient architectures for agricultural scene understanding. In addition, we explore strategies to enhance the performance of efficient transformers suitable for resource-constrained platforms, such as agricultural robots. Finally, we accelerate video instance segmentation (VIS) by pruning redundant tokens while preserving the ability to identify and track objects in dynamic environments. The main contributions are summarized as follows:
We propose a novel knowledge distillation framework to compress large and complex CNN-based models into smaller and more efficient ones for panoptic segmentation. Unlike prior work that mainly focuses on classification, our approach addresses the underexplored regression tasks in dense prediction. The framework leverages both final predictions and intermediate features as supervisory signals, enabling compact models to achieve improved segmentation performance. We validate the method on diverse datasets, including arable farming and horticultural scenarios, and show that large CNNs can be effectively compressed without significant loss in accuracy.
Building on our distillation framework for CNNs, we extend it to transformer-based models for dense prediction. Due to the structural differences between CNNs and transformers, traditional distillation strategies are not directly applicable. To overcome this, we introduce a query-level distillation scheme that aligns unordered queries between complex and efficient transformers. This allows lightweight transformer models to mimic the behavior of large-scale ones and achieve higher accuracy with reduced complexity.
To further enhance efficiency, we apply distillation to compact transformer-based detectors. Since efficient transformers typically underperform full-scale models, direct distillation often results in degraded performance. We address this by proposing an ensemble knowledge distillation strategy: multiple transformers are first ensembled into a stronger model, which then supervises the training of the efficient structure. This enables lightweight transformer models to better balance speed and accuracy, making them more suitable for deployment on resource-constrained platforms.
Finally, we improve efficiency in video instance segmentation (VIS) by compressing transformer-based models through token reduction. While token pruning has been widely explored in image classification, we adapt it for VIS by selectively removing redundant tokens. Our approach identifies informative tokens at the lowest-resolution feature map and propagates them to subsequent layers, reducing inference cost while maintaining accuracy. By adjusting pruning ratios, we achieve a flexible trade-off between inference speed and segmentation performance.
In conclusion, this thesis presents a set of strategies for compressing CNN- and transformer-based models, focusing on knowledge distillation and token pruning. The proposed methods are extensively evaluated on both agriculture-specific and commonly used datasets, demonstrating their utility in crop monitoring and decision-making tasks. All contributions are supported by peer-reviewed publications, highlighting their potential for real-world deployment in smart farming.

Klassifikation (DDC)

004 Informatik

630 Landwirtschaft, Veterinärmedizin

Zugehörige Publikation(en)

https://doi.org/10.1109/IROS55552.2023.10342527
https://doi.org/10.1109/CVPRW63382.2024.00552

Zitiervorschlag
BibTeX

Li, Maohui: Efficient Instance Segmentation for Agricultural Applications. - Bonn, 2026. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-89046

@phdthesis{handle:20.500.11811/14019,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-89046,
author = {{Maohui Li}},
title = {Efficient Instance Segmentation for Agricultural Applications},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2026,
month = mar,
note = {Precise and smart farming technologies, such as agricultural robots, have emerged as crucial solutions to address the challenges of food security and environmental sustainability. Equipped with advanced computer vision capabilities, these systems can automate tasks such as crop monitoring, pest detection, and yield estimation. However, their deployment is often limited by the high computational and memory demands of the vision models. To enable real-time agricultural robotics, efficient yet high-performing deep learning models are required.
To address this challenge, this thesis investigates techniques for compressing large models into lightweight structures that incur only minor performance degradation. We compress both convolutional neural network (CNN)-based and transformer-based models for dense prediction tasks to develop efficient architectures for agricultural scene understanding. In addition, we explore strategies to enhance the performance of efficient transformers suitable for resource-constrained platforms, such as agricultural robots. Finally, we accelerate video instance segmentation (VIS) by pruning redundant tokens while preserving the ability to identify and track objects in dynamic environments. The main contributions are summarized as follows:
We propose a novel knowledge distillation framework to compress large and complex CNN-based models into smaller and more efficient ones for panoptic segmentation. Unlike prior work that mainly focuses on classification, our approach addresses the underexplored regression tasks in dense prediction. The framework leverages both final predictions and intermediate features as supervisory signals, enabling compact models to achieve improved segmentation performance. We validate the method on diverse datasets, including arable farming and horticultural scenarios, and show that large CNNs can be effectively compressed without significant loss in accuracy.
Building on our distillation framework for CNNs, we extend it to transformer-based models for dense prediction. Due to the structural differences between CNNs and transformers, traditional distillation strategies are not directly applicable. To overcome this, we introduce a query-level distillation scheme that aligns unordered queries between complex and efficient transformers. This allows lightweight transformer models to mimic the behavior of large-scale ones and achieve higher accuracy with reduced complexity.
To further enhance efficiency, we apply distillation to compact transformer-based detectors. Since efficient transformers typically underperform full-scale models, direct distillation often results in degraded performance. We address this by proposing an ensemble knowledge distillation strategy: multiple transformers are first ensembled into a stronger model, which then supervises the training of the efficient structure. This enables lightweight transformer models to better balance speed and accuracy, making them more suitable for deployment on resource-constrained platforms.
Finally, we improve efficiency in video instance segmentation (VIS) by compressing transformer-based models through token reduction. While token pruning has been widely explored in image classification, we adapt it for VIS by selectively removing redundant tokens. Our approach identifies informative tokens at the lowest-resolution feature map and propagates them to subsequent layers, reducing inference cost while maintaining accuracy. By adjusting pruning ratios, we achieve a flexible trade-off between inference speed and segmentation performance.
In conclusion, this thesis presents a set of strategies for compressing CNN- and transformer-based models, focusing on knowledge distillation and token pruning. The proposed methods are extensively evaluated on both agriculture-specific and commonly used datasets, demonstrating their utility in crop monitoring and decision-making tasks. All contributions are supported by peer-reviewed publications, highlighting their potential for real-world deployment in smart farming.},
url = {https://hdl.handle.net/20.500.11811/14019}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: