首页 » 文章 » 文章详细信息
Computational and Mathematical Methods in Medicine Volume 2020 ,2020-03-28
HMMPred: Accurate Prediction of DNA-Binding Proteins Based on HMM Profiles and XGBoost Feature Selection
Research Article
Xiuzhi Sang 1 Wanyue Xiao 2 Huiwen Zheng 3 Yang Yang 4 Taigang Liu 1
Show affiliations
DOI:10.1155/2020/1384749
Received 2020-01-29, accepted for publication 2020-03-16, Published 2020-03-28
PDF
摘要

Prediction of DNA-binding proteins (DBPs) has become a popular research topic in protein science due to its crucial role in all aspects of biological activities. Even though considerable efforts have been devoted to developing powerful computational methods to solve this problem, it is still a challenging task in the field of bioinformatics. A hidden Markov model (HMM) profile has been proved to provide important clues for improving the prediction performance of DBPs. In this paper, we propose a method, called HMMPred, which extracts the features of amino acid composition and auto- and cross-covariance transformation from the HMM profiles, to help train a machine learning model for identification of DBPs. Then, a feature selection technique is performed based on the extreme gradient boosting (XGBoost) algorithm. Finally, the selected optimal features are fed into a support vector machine (SVM) classifier to predict DBPs. The experimental results tested on two benchmark datasets show that the proposed method is superior to most of the existing methods and could serve as an alternative tool to identify DBPs.

授权许可

Copyright © 2020 Xiuzhi Sang et al. 2020
This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

通讯作者

1. Yang Yang.School of Information Management, Nanjing University, Nanjing 210023, China, nju.edu.cn.njukyang@hotmail.com
2. Taigang Liu.College of Information, Shanghai Ocean University, Shanghai 201306, China, shou.edu.cn.tgliu@shou.edu.cn

推荐引用方式

Xiuzhi Sang,Wanyue Xiao,Huiwen Zheng,Yang Yang,Taigang Liu. HMMPred: Accurate Prediction of DNA-Binding Proteins Based on HMM Profiles and XGBoost Feature Selection. Computational and Mathematical Methods in Medicine ,Vol.2020(2020)

您觉得这篇文章对您有帮助吗?
分享和收藏
0

是否收藏?

参考文献
[1] Y. Yao, X. Li, B. Liao, L. Huang. et al.(2017). Predicting influenza antigenicity from Hemagglutintin sequence data based on a joint random forest method. Scientific Reports.7(1):1545. DOI: 10.1093/nar/gkq061.
[2] C. Cortes, V. Vapnik. (1995). Support-vector networks. Machine Learning.20(3):273-297. DOI: 10.1093/nar/gkq061.
[3] S. Chauhan, S. Ahmad. (2020). Enabling full-length evolutionary profiles based deep convolutional neural network for predicting DNA-binding proteins from sequence. Proteins.88(1):15-30. DOI: 10.1093/nar/gkq061.
[4] K. K. Paliwal, A. Sharma, J. Lyons, A. Dehzangi. et al.(2014). A tri-gram based feature extraction technique using linear probabilities of position specific scoring matrix for protein fold recognition. IEEE Transactions on Nanobioscience.13(1):44-50. DOI: 10.1093/nar/gkq061.
[5] M. Remmert, A. Biegert, A. Hauser, J. Söding. et al.(2012). HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods.9(2):173-175. DOI: 10.1093/nar/gkq061.
[6] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland. et al.(2000). The protein data bank. Nucleic Acids Research.28(1):235-242. DOI: 10.1093/nar/gkq061.
[7] X.-J. Liu, X.-J. Gong, H. Yu, J. H. Xu. et al.(2018). A model stacking framework for identifying DNA binding proteins by orchestrating multi-view features and classifiers. Genes.9(8):394. DOI: 10.1093/nar/gkq061.
[8] R. E. Langlois, H. Lu. (2010). Boosting the prediction and understanding of DNA-binding domains from sequence. Nucleic Acids Research.38(10):3149-3158. DOI: 10.1093/nar/gkq061.
[9] T. Liu, Y. Qin, Y. Wang, C. Wang. et al.(2016). Prediction of protein structural class based on gapped-dipeptides and a recursive feature selection approach. International Journal of Molecular Sciences.17(1):15. DOI: 10.1093/nar/gkq061.
[10] Y.-H. Qu, H. Yu, X.-J. Gong, J. H. Xu. et al.(2017). On the prediction of DNA-binding proteins only from primary sequences: a deep learning approach. PLoS One.12(12, article e0188129). DOI: 10.1093/nar/gkq061.
[11] W. You, Z. Yang, G. Guo, X. F. Wan. et al.(2019). Prediction of DNA-binding proteins by interaction fusion feature representation and selective ensemble. Knowledge-Based Systems.163:598-610. DOI: 10.1093/nar/gkq061.
[12] L. Breiman. (2001). Random forests. Machine Learning.45(1):5-32. DOI: 10.1093/nar/gkq061.
[13] B. Li, L. Cai, B. Liao, X. Fu. et al.(2019). Prediction of protein subcellular localization based on fusion of multi-view features. Molecules.24(5). DOI: 10.1093/nar/gkq061.
[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel. et al.(2011). Scikit-learn: machine learning in Python. Journal of Machine Learning Research.12:2825-2830. DOI: 10.1093/nar/gkq061.
[15] L. Nanni, A. Lumini. (2008). Combing ontologies and dipeptide composition for predicting DNA-binding proteins. Amino Acids.34(4):635-641. DOI: 10.1093/nar/gkq061.
[16] The UniProt Consortium. (2017). UniProt: the universal protein knowledgebase. Nucleic Acids Research.45(D1):D158-D169. DOI: 10.1093/nar/gkq061.
[17] F. Ali, S. Ahmed, Z. N. K. Swati, S. Akbar. et al.(2019). DP-BINDER: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information. Journal of Computer-Aided Molecular Design.33(7):645-658. DOI: 10.1093/nar/gkq061.
[18] K. A. Jones, J. T. Kadonaga, P. J. Rosenfeld, T. J. Kelly. et al.(1987). A cellular DNA-binding protein that activates eukaryotic transcription and DNA replication. Cell.48(1):79-89. DOI: 10.1093/nar/gkq061.
[19] W. Lou, X. Wang, F. Chen, Y. Chen. et al.(2014). Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian Naïve Bayes. PLoS One.9(1, article e86703). DOI: 10.1093/nar/gkq061.
[20] S. Hu, R. Ma, H. Wang. (2019). An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences. PLoS One.14(11, article e0225317). DOI: 10.1093/nar/gkq061.
[21] Q. Dong, S. Zhou, J. Guan. (2009). A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics.25(20):2655-2662. DOI: 10.1093/nar/gkq061.
[22] G. B. Motion, A. J. M. Howden, E. Huitema, S. Jones. et al.(2015). DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool. Nucleic Acids Research.43(22, article e158). DOI: 10.1093/nar/gkq061.
[23] X. Li, T. Liu, P. Tao, C. Wang. et al.(2015). A highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature elimination. Computational Biology and Chemistry.59:95-100. DOI: 10.1093/nar/gkq061.
[24] L. Nanni, S. Brahnam. (2019). Set of approaches based on 3D structure and position specific-scoring matrix for predicting DNA-binding proteins. Bioinformatics.35(11):1844-1851. DOI: 10.1093/nar/gkq061.
[25] L. Chen, S. Wang, Y.-H. Zhang, J. Li. et al.(2017). Identify key sequence features to improve CRISPR sgRNA efficacy. IEEE Access.5:26582-26590. DOI: 10.1093/nar/gkq061.
[26] B. Liu, J. Xu, X. Lan, R. Xu. et al.(2014). iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One.9(9, article e106691). DOI: 10.1093/nar/gkq061.
[27] X. Fu, W. Zhu, B. Liao, L. Cai. et al.(2018). Improved DNA-binding protein identification by incorporating evolutionary information into the Chou’s PseAAC. IEEE Access.6:66545-66556. DOI: 10.1093/nar/gkq061.
[28] T. Liu, X. Geng, X. Zheng, R. Li. et al.(2012). Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles. Amino Acids.42(6):2243-2249. DOI: 10.1093/nar/gkq061.
[29] M. Kumar, M. M. Gromiha, G. P. Raghava. (2007). Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics.8(1):463. DOI: 10.1093/nar/gkq061.
[30] K. Qu, K. Han, S. Wu, G. Wang. et al.(2017). Identification of DNA-binding proteins using mixed feature representation methods. Molecules.22(10):1602. DOI: 10.1093/nar/gkq061.
[31] M. Waris, K. Ahmad, M. Kabir, M. Hayat. et al.(2016). Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix. Neurocomputing.199:154-162. DOI: 10.1093/nar/gkq061.
[32] Y. Wang, Y. Ding, F. Guo, L. Wei. et al.(2017). Improved detection of DNA-binding proteins via compression technology on PSSM information. PLoS One.12(9, article e0185587). DOI: 10.1093/nar/gkq061.
[33] J. Zhang, B. Liu. (2017). PSFM-DBT: identifying DNA-binding proteins by combing position specific frequency matrix and distance-bigram transformation. International Journal of Molecular Sciences.18(9). DOI: 10.1093/nar/gkq061.
[34] B. Liu, S. Wang, X. Wang. (2015). DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Scientific Reports.5(1, article 15479). DOI: 10.1093/nar/gkq061.
[35] K. K. Kumar, G. Pugalenthi, P. N. Suganthan. (2009). DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. Journal of Biomolecular Structure & Dynamics.26(6):679-686. DOI: 10.1093/nar/gkq061.
[36] J. T. Wassan, H. Wang, F. Browne, H. Zheng. et al.(2019). Phy-PMRFI: phylogeny-aware prediction of metagenomic functions using random forest feature importance. IEEE Transactions on Nanobioscience.18(3):273-282. DOI: 10.1093/nar/gkq061.
[37] S. Y. Chowdhury, S. Shatabda, A. Dehzangi. (2017). iDNAProt-ES: identification of DNA-binding proteins using evolutionary and structural features. Scientific Reports.7(1, article 14938). DOI: 10.1093/nar/gkq061.
[38] K. J. Archer, R. V. Kimes. (2008). Empirical characterization of random forest variable importance measures. Computational Statistics & Data Analysis.52(4):2249-2260. DOI: 10.1093/nar/gkq061.
[39] T. Chen, C. Guestrin. XGBoost: a scalable tree boosting system. :785-794. DOI: 10.1093/nar/gkq061.
[40] B. Liu, J. Xu, S. Fan, R. Xu. et al.(2015). PseDNA-Pro: DNA-binding protein identification by combining Chou’s PseAAC and physicochemical distance transformation. Molecular Informatics.34(1):8-17. DOI: 10.1093/nar/gkq061.
[41] S. Adilina, D. M. Farid, S. Shatabda. (2019). Effective DNA binding protein prediction by using key features via Chou's general PseAAC. Journal of Theoretical Biology.460:64-78. DOI: 10.1093/nar/gkq061.
[42] J. P. Zhou, L. Chen, Z. H. Guo. (2020). iATC-NRAKEL: an efficient multi-label classifier for recognizing anatomical therapeutic chemical classes of drugs. Bioinformatics.36(5):1391-1396. DOI: 10.1093/nar/gkq061.
[43] W.-Z. Lin, J.-A. Fang, X. Xiao, K. C. Chou. et al.(2011). iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One.6(9, article e24756). DOI: 10.1093/nar/gkq061.
[44] M. S. Rahman, S. Shatabda, S. Saha, M. Kaykobad. et al.(2018). DPP-PseAAC: a DNA-binding protein prediction model using Chou’s general PseAAC. Journal of Theoretical Biology.452:22-34. DOI: 10.1093/nar/gkq061.
[45] L. Wei, J. Tang, Q. Zou. (2017). Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information. Information Sciences.384:135-144. DOI: 10.1093/nar/gkq061.
[46] Q. Dong, S. Wang, K. Wang, X. Liu. et al.Identification of DNA-binding proteins by auto-cross covariance transformation. :470-475. DOI: 10.1093/nar/gkq061.
[47] J. Wang, B. Yang, J. Revote, A. Leier. et al.(2017). POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics.33(17):2756-2758. DOI: 10.1093/nar/gkq061.
[48] R. Zaman, S. Y. Chowdhury, M. A. Rashid, A. Sharma. et al.(2017). HMMBinder: DNA-binding protein prediction using HMM profile based features. BioMed Research International.2017-10. DOI: 10.1093/nar/gkq061.
[49] B. Liu, D. Zhang, R. Xu, J. Xu. et al.(2014). Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics.30(4):472-479. DOI: 10.1093/nar/gkq061.
[50] A. Mishra, P. Pokhrel, M. T. Hoque. (2019). StackDPPred: a stacking based prediction of DNA-binding protein from sequence. Bioinformatics.35(3):433-441. DOI: 10.1093/nar/gkq061.
[51] X. Zhang, L. Chen, Z.-H. Guo, H. Liang. et al.(2019). Identification of human membrane protein types by incorporating network embedding methods. IEEE Access.7:140794-140805. DOI: 10.1093/nar/gkq061.
[52] J. Li, L. Lu, Y.-H. Zhang, Y. C. Xu. et al.(2020). Identification of leukemia stem cell expression signatures through Monte Carlo feature selection strategy and support vector machine. Cancer Gene Therapy.27(1-2):56-69. DOI: 10.1093/nar/gkq061.
文献评价指标
浏览 17次
下载全文 2次
评分次数 0次
用户评分 0.0分
分享 0次