AHRD: Automatically Annotate Proteins with Human Readable Descriptions and Gene Ontology Terms

Boecker, Florian

dc.contributor.advisor	Schoof, Heiko
dc.contributor.author	Boecker, Florian
dc.date.accessioned	2021-10-05T14:04:35Z
dc.date.available	2021-10-05T14:04:35Z
dc.date.issued	05.10.2021
dc.identifier.uri	https://hdl.handle.net/20.500.11811/9344
dc.description.abstract	In the postgenomic era it is impossible to annotate the majority of new proteins in any other way than with computational methods. Our tool AHRD automatically annotates proteins with human readable descriptions and Gene Ontology (GO) terms on a genomic scale. It does so by performing a lexical analysis modeled on the decision process of a human curator investigating the protein descriptions of homologous proteins found by sequence similarity. The central questions of this thesis are how GO annotations can be accurately evaluated and how the annotation performance of AHRD can be increased. To this end we firstly generated an unbiased ground truth set of high quality protein annotations with minimal redundancy. It contains many proteins that are difficult to annotate and thus facilitates contrasting annotation methods. Secondly, we implemented and tested three evaluation metrics for the congruence of GO term annotations. The third metric, which employs the structure of the Gene Ontology and the commonness of GO terms to determine the semantic similarity of GO annotations, is able to perform the most nuanced and consistent evaluation. In addition to a preexisting simulated annealing-based approach a genetic algorithm-based machine learning method was implemented to use the aforementioned evaluation metrics to optimize AHRD's input parameters. Although the genetic algorithm was only able to provide small improvements, they were statistically significant and parameter optimization proved to be necessary to achieve optimal annotation performance. In the style of the lexical analysis of candidate descriptions a new GO term-based analysis for candidate annotations was created. This was able to improve AHRD's GO annotation performance and also enabled the incorporation of new quality indicators such as GO term information content and annotation evidence codes which improved the performance further. It also facilitated the annotation with newly combined sets of GO terms instead of only fixed sets obtained from reference proteins. However, this approach proved to be not viable as it resulted in a significant regression of annotation performance. Using our evaluation method we were able to show that AHRD is able to predict description and GO annotations better and at a greater coverage than most of its competitors. Despite the fact that AHRD is tailored for the application to whole proteomes from all branches of life and for ease of use, in the CAFA3 challenge, a community-driven evaluation of GO annotation methods that often do not have these benefits, AHRD was able to show satisfactory results in most categories. In conclusion, we demonstrated a reliable GO annotation evaluation method and used it to develop AHRD's GO annotation from an afterthought to a mature feature. We showed that AHRD is not only successful at the annotation of descriptions but also at GO terms, while staying applicable in whole genome projects.	en
dc.language.iso	eng
dc.rights	In Copyright
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	Protein
dc.subject	Funktionsvorhersage
dc.subject	Genomik
dc.subject	Proteomik
dc.subject	Bioinformatik
dc.subject	Function Prediction
dc.subject	Genomics
dc.subject	Proteomics
dc.subject	Bioinformatics
dc.subject.ddc	004 Informatik
dc.subject.ddc	500 Naturwissenschaften
dc.subject.ddc	570 Biowissenschaften, Biologie
dc.title	AHRD: Automatically Annotate Proteins with Human Readable Descriptions and Gene Ontology Terms
dc.type	Dissertation oder Habilitation
dc.publisher.name	Universitäts- und Landesbibliothek Bonn
dc.publisher.location	Bonn
dc.rights.accessRights	openAccess
dc.identifier.urn	https://nbn-resolving.org/urn:nbn:de:hbz:5-63141
ulbbn.pubtype	Erstveröffentlichung
ulbbnediss.affiliation.name	Rheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.location	Bonn
ulbbnediss.thesis.level	Dissertation
ulbbnediss.dissID	6314
ulbbnediss.date.accepted	14.07.2021
ulbbnediss.institute	Landwirtschaftliche Fakultät : Institut für Nutzpflanzenwissenschaften und Ressourcenschutz (INRES)
ulbbnediss.fakultaet	Mathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coReferee	Hofmann-Apitius, Martin
ulbbnediss.contributor.orcid	https://orcid.org/0000-0002-0732-6914
ulbbnediss.contributor.gnd	1246050447

Files in this item

Name:: 6314.pdf
Size:: 1.5MB
Format:: PDF

View/Open

This item appears in the following Collection(s)

E-Dissertationen (4442)

Show simple item record

The following license files are associated with this item: