Computational annotation of protein function: what is recently happening?

- October 20, 2023

Proteins constitute the primary foundation of life, serving as pivotal components in the execution of vital life functions. Therefore, the essential task of functionally annotating proteins is paramount for comprehending life processes at the molecular level. For this purpose, various computational methods based on machine learning and deep learning, such as biological sequence analysis, protein structure prediction, and medical image processing, have been published. Typically, machine learning methods amalgamate features extracted from diverse data sources to assess the similarity between proteins and functional terms, leading to the annotation of similar functions for proteins exhibiting this likeness. Models based on deep learning typically emphasize extracting protein sequence features using convolutional neural networks and recurrent neural networks. They subsequently incorporate sequence similarity, protein-protein interaction (PPI) network data, and other information to enhance model performance. Various such methods are listed in Table 1.

Table 1: Some of the well-established methods for protein annotations. The upper row includes deep learning-based methods, whereas the lower row denotes ML-based methods.

DeepGO [1]	DeepGOPlus [2]	deepNF [3]	DeepMNE[4]	DeepGOA [5]	HnetGO[6]
GeneMANIA [7]	MS-kNN [8]	NetGO [9]

One of the recent protein annotation methods is HnetGO [6]. This method consists of three layers of submethods: 1) construction of a heterogeneous network, ii) a pretraining model to extract protein-level sequence features, and eventually, iii) a link prediction method to predict protein function.

The first submethod uses human and mouse protein sequence data from UniProt and protein-protein interaction data from STRING. The gene ontology information is extracted from the UniProt and GO website.

The heterogeneous network is constructed using information extracted from the GO and Uniprot databases. The network construction method includes a filtering method in the “Biological Process”, “Molecular Function”, and “Cellular Component” categories.

To extract protein-level features, the method uses SeqVec. It captures the long-range association of protein sequence, generates an amino-acid-level embedding vector, and then directly obtains an effective protein-level semantic representation through average aggregation.

Finally, in the prediction stage, the HNetGO model relies on the neighborhood information of the node, so the subgraph containing this protein node can be extracted through neighborhood sampling to perform function prediction.

Despite having a better methodology as compared to other methods, HNetGo is constrained by the fact that it has a higher false positive rate. Nevertheless, it shows a better performance than its counterparts, and hence, it won't be an embellishment to say that increasing the model complexity and fully mining the GO database will make it a promising candidate in the realm of protein annotation prediction.

References:

1. Kulmanov M, Khan MA, Hoehndorf R, Wren J. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics. 2018;34: 660–668.

2. Kulmanov M, Hoehndorf R. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics. 2021;37: 1187.

3. Gligorijevic V, Barot M, Bonneau R. deepNF: deep network fusion for protein function prediction. Bioinformatics. 2018;34: 3873–3881.

4. Ma Y. DeepMNE: Deep Multi-Network Embedding for lncRNA-Disease Association Prediction. IEEE J Biomed Health Inform. 2022;26: 3539–3549.

5. Zhou G, Wang J, Zhang X, Yu G. DeepGOA: Predicting gene ontology annotations of proteins via graph convolutional network. 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2019. doi:10.1109/bibm47256.2019.8983075

6. Zhang X, Guo H, Zhang F, Wang X, Wu K, Qiu S, et al. HNetGO: protein function prediction via heterogeneous network transformer. Brief Bioinform. 2023;24: bbab556.

7. Franz M, Rodriguez H, Lopes C, Zuberi K, Montojo J, Bader GD, et al. GeneMANIA update 2018. Nucleic Acids Res. 2018;46: W60–W64.

8. Lan L, Djuric N, Guo Y, Vucetic S. MS-kNN: protein function prediction by integrating multiple data sources. BMC Bioinformatics. 2013;14 Suppl 3: S8.

9. You R, Yao S, Xiong Y, Huang X, Sun F, Mamitsuka H, et al. NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 2019;47: W379–W387.