Boecker, Florian: AHRD: Automatically Annotate Proteins with Human Readable Descriptions and Gene Ontology Terms. - Bonn, 2021. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-63141
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-63141
@phdthesis{handle:20.500.11811/9344,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-63141,
author = {{Florian Boecker}},
title = {AHRD: Automatically Annotate Proteins with Human Readable Descriptions and Gene Ontology Terms},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2021,
month = oct,
note = {In the postgenomic era it is impossible to annotate the majority of new proteins in any other way than with computational methods. Our tool AHRD automatically annotates proteins with human readable descriptions and Gene Ontology (GO) terms on a genomic scale. It does so by performing a lexical analysis modeled on the decision process of a human curator investigating the protein descriptions of homologous proteins found by sequence similarity.
The central questions of this thesis are how GO annotations can be accurately evaluated and how the annotation performance of AHRD can be increased.
To this end we firstly generated an unbiased ground truth set of high quality protein annotations with minimal redundancy. It contains many proteins that are difficult to annotate and thus facilitates contrasting annotation methods. Secondly, we implemented and tested three evaluation metrics for the congruence of GO term annotations. The third metric, which employs the structure of the Gene Ontology and the commonness of GO terms to determine the semantic similarity of GO annotations, is able to perform the most nuanced and consistent evaluation. In addition to a preexisting simulated annealing-based approach a genetic algorithm-based machine learning method was implemented to use the aforementioned evaluation metrics to optimize AHRD's input parameters. Although the genetic algorithm was only able to provide small improvements, they were statistically significant and parameter optimization proved to be necessary to achieve optimal annotation performance. In the style of the lexical analysis of candidate descriptions a new GO term-based analysis for candidate annotations was created. This was able to improve AHRD's GO annotation performance and also enabled the incorporation of new quality indicators such as GO term information content and annotation evidence codes which improved the performance further. It also facilitated the annotation with newly combined sets of GO terms instead of only fixed sets obtained from reference proteins. However, this approach proved to be not viable as it resulted in a significant regression of annotation performance. Using our evaluation method we were able to show that AHRD is able to predict description and GO annotations better and at a greater coverage than most of its competitors. Despite the fact that AHRD is tailored for the application to whole proteomes from all branches of life and for ease of use, in the CAFA3 challenge, a community-driven evaluation of GO annotation methods that often do not have these benefits, AHRD was able to show satisfactory results in most categories.
In conclusion, we demonstrated a reliable GO annotation evaluation method and used it to develop AHRD's GO annotation from an afterthought to a mature feature. We showed that AHRD is not only successful at the annotation of descriptions but also at GO terms, while staying applicable in whole genome projects.},
url = {https://hdl.handle.net/20.500.11811/9344}
}
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-63141,
author = {{Florian Boecker}},
title = {AHRD: Automatically Annotate Proteins with Human Readable Descriptions and Gene Ontology Terms},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2021,
month = oct,
note = {In the postgenomic era it is impossible to annotate the majority of new proteins in any other way than with computational methods. Our tool AHRD automatically annotates proteins with human readable descriptions and Gene Ontology (GO) terms on a genomic scale. It does so by performing a lexical analysis modeled on the decision process of a human curator investigating the protein descriptions of homologous proteins found by sequence similarity.
The central questions of this thesis are how GO annotations can be accurately evaluated and how the annotation performance of AHRD can be increased.
To this end we firstly generated an unbiased ground truth set of high quality protein annotations with minimal redundancy. It contains many proteins that are difficult to annotate and thus facilitates contrasting annotation methods. Secondly, we implemented and tested three evaluation metrics for the congruence of GO term annotations. The third metric, which employs the structure of the Gene Ontology and the commonness of GO terms to determine the semantic similarity of GO annotations, is able to perform the most nuanced and consistent evaluation. In addition to a preexisting simulated annealing-based approach a genetic algorithm-based machine learning method was implemented to use the aforementioned evaluation metrics to optimize AHRD's input parameters. Although the genetic algorithm was only able to provide small improvements, they were statistically significant and parameter optimization proved to be necessary to achieve optimal annotation performance. In the style of the lexical analysis of candidate descriptions a new GO term-based analysis for candidate annotations was created. This was able to improve AHRD's GO annotation performance and also enabled the incorporation of new quality indicators such as GO term information content and annotation evidence codes which improved the performance further. It also facilitated the annotation with newly combined sets of GO terms instead of only fixed sets obtained from reference proteins. However, this approach proved to be not viable as it resulted in a significant regression of annotation performance. Using our evaluation method we were able to show that AHRD is able to predict description and GO annotations better and at a greater coverage than most of its competitors. Despite the fact that AHRD is tailored for the application to whole proteomes from all branches of life and for ease of use, in the CAFA3 challenge, a community-driven evaluation of GO annotation methods that often do not have these benefits, AHRD was able to show satisfactory results in most categories.
In conclusion, we demonstrated a reliable GO annotation evaluation method and used it to develop AHRD's GO annotation from an afterthought to a mature feature. We showed that AHRD is not only successful at the annotation of descriptions but also at GO terms, while staying applicable in whole genome projects.},
url = {https://hdl.handle.net/20.500.11811/9344}
}