首页 » 文章 » 文章详细信息
International Journal of Genomics Volume 2018 ,2018-01-10
Ensemble Methods with Voting Protocols Exhibit Superior Performance for Predicting Cancer Clinical Endpoints and Providing More Complete Coverage of Disease-Related Genes
Research Article
Runyu Jing 1 Yu Liang 2 Yi Ran 3 Shengzhong Feng 1 Yanjie Wei 1 Li He 3
Show affiliations
DOI:10.1155/2018/8124950
Received 2017-07-27, accepted for publication 2017-11-14, Published 2017-11-14
PDF
摘要

In genetic data modeling, the use of a limited number of samples for modeling and predicting, especially well below the attribute number, is difficult due to the enormous number of genes detected by a sequencing platform. In addition, many studies commonly use machine learning methods to evaluate genetic datasets to identify potential disease-related genes and drug targets, but to the best of our knowledge, the information associated with the selected gene set was not thoroughly elucidated in previous studies. To identify a relatively stable scheme for modeling limited samples in the gene datasets and reveal the information that they contain, the present study first evaluated the performance of a series of modeling approaches for predicting clinical endpoints of cancer and later integrated the results using various voting protocols. As a result, we proposed a relatively stable scheme that used a set of methods with an ensemble algorithm. Our findings indicated that the ensemble methodologies are more reliable for predicting cancer prognoses than single machine learning algorithms as well as for gene function evaluating. The ensemble methodologies provide a more complete coverage of relevant genes, which can facilitate the exploration of cancer mechanisms and the identification of potential drug targets.

授权许可

Copyright © 2018 Runyu Jing et al. 2018
This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

图表

Work flow of the whole process. First, the datasets were downloaded from the GDC (Genomic Data Commons) database. Next, the downloaded mRNA and microRNA sequencing data are united by the usable information. The t-test was used afterwards to determine the significantly expressed genes. Five selection methods were used to select the cancer-associated genes and the subdatasets generated according to the ranks. Finally, the prediction results were integrated by a voting protocol. Note that every subdataset was divided into two pieces for cross-validation and independent test in the ratio 4 : 1 before variable selection. Only the datasets for cross-validation will be used for variable selection and modeling.

MCC of the BRCA-mRNA group by the functions class.

MCC of the BRCA-mRNA group by reduced datasets. ∗Note that the subdatasets from OV-miRNA have at most 83 micro-RNAs and thus the scale "100" of OV means 83.

MCC of the BRCA-miRNA group by the functions class.

MCC of the BRCA-miRNA group by reduced datasets. ∗Note that the subdatasets from OV-miRNA have at most 83 micro-RNAs and thus the scale "100" of OV means 83.

MCC of the KIRC-mRNA group by the functions class.

MCC of the KIRC-mRNA group by reduced datasets. ∗Note that the subdatasets from OV-miRNA have at most 83 micro-RNAs and thus the scale "100" of OV means 83.

MCC of the KIRC-miRNA group by the functions class.

MCC of the KIRC-miRNA group by reduced datasets. ∗Note that the subdatasets from OV-miRNA have at most 83 micro-RNAs and thus the scale "100" of OV means 83.

MCC of the OV-mRNA group by the functions class.

MCC of the OV-mRNA group by reduced datasets. ∗Note that the subdatasets from OV-miRNA have at most 83 micro-RNAs and thus the scale "100" of OV means 83.

MCC of the OV-miRNA group by the functions class.

MCC of the OV-miRNA group by reduced datasets. In the 12 box plots, the line in the box is the median. The upper and lower boundaries of the box are Q1 and Q3. The boundaries of the dotted line are the whiskers. ∗The subdatasets from OV-miRNA have at most 83 microRNAs, and thus, the scale “100” of OV means 83. ∗Note that the subdatasets from OV-miRNA have at most 83 micro-RNAs and thus the scale "100" of OV means 83.

通讯作者

1. Yanjie Wei.Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China, cas.cn.yj.wei@siat.ac.cn
2. Li He.Biogas Appliance Quality Supervision and Inspection Center, Biogas Institute of Ministry of Agriculture, Chengdu, Sichuan, China, biogas.cn.helibiogas@126.com

推荐引用方式

Runyu Jing,Yu Liang,Yi Ran,Shengzhong Feng,Yanjie Wei,Li He. Ensemble Methods with Voting Protocols Exhibit Superior Performance for Predicting Cancer Clinical Endpoints and Providing More Complete Coverage of Disease-Related Genes. International Journal of Genomics ,Vol.2018(2018)

您觉得这篇文章对您有帮助吗?
分享和收藏
15

是否收藏?

参考文献
[1] A. Moslemi, H. Mahjub, M. Saidijam, J. Poorolajal. et al.(2016). Bayesian survival analysis of high-dimensional microarray data for mantle cell lymphoma patients. Asian Pacific Journal of Cancer Prevention.17(1):95-100. DOI: 10.1016/j.csbj.2014.11.005.
[2] J. Xu, R. Jing, Y. Liu, Y. Dong. et al.(2016). A new strategy for exploring the hierarchical structure of cancers by adaptively partitioning functional modules from gene expression network. Scientific Reports.6(1). DOI: 10.1016/j.csbj.2014.11.005.
[3] C.-C. Chang, C.-J. Lin. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology.2(3):1-27. DOI: 10.1016/j.csbj.2014.11.005.
[4] J. Hou, J. Aerts, B. den Hamer, W. van Ijcken. et al.(2010). Gene expression-based classification of non-small cell lung carcinomas and survival prediction. PloS One.5(4, article e10312). DOI: 10.1016/j.csbj.2014.11.005.
[5] M. Hall, E. Frank, G. Holmes, B. Pfahringer. et al.(2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter.11(1):10-18. DOI: 10.1016/j.csbj.2014.11.005.
[6] O. Gevaert, F. D. Smet, D. Timmerman, Y. Moreau. et al.(2006). Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics.22(14):e184-e190. DOI: 10.1016/j.csbj.2014.11.005.
[7] L. He, Y. Wang, Y. Yang, L. Huang. et al.(2014). Identifying the gene signatures from gene-pathway bipartite network guarantees the robust model performance on predicting the cancer prognosis. BioMed Research International.2014-10. DOI: 10.1016/j.csbj.2014.11.005.
[8] S. Le Cessie, J. C. Van Houwelingen. (1992). Ridge estimators in logistic regression. Applied Statistics.41(1):191-201. DOI: 10.1016/j.csbj.2014.11.005.
[9] X. Xu, Y. Zhang, L. Zou, M. Wang. et al.A gene signature for breast cancer prognosis using support vector machine. :928-931. DOI: 10.1016/j.csbj.2014.11.005.
[10] E. Frank, M. Hall, B. Pfahringer. Locally weighted naive bayes. :249-256. DOI: 10.1016/j.csbj.2014.11.005.
[11] C. Park, J. Ahn, H. Kim, S. Park. et al.(2014). Integrative gene network construction to analyze cancer recurrence using semi-supervised learning. PloS One.9(1, article e86309). DOI: 10.1016/j.csbj.2014.11.005.
[12] Y. Ishibashi, N. Hanyu, K. Nakada, Y. Suzuki. et al.(2003). Profiling gene expression ratios of paired cancerous and normal tissue predicts relapse of esophageal squamous cell carcinoma. Cancer Research.63(16):5159-5164. DOI: 10.1016/j.csbj.2014.11.005.
[13] F. M. Lopes, R. M. Cesar, L. D. F. Costa. (2011). Gene expression complex networks: synthesis, identification, and analysis. Journal of Computational Biology.18(10):1353-1367. DOI: 10.1016/j.csbj.2014.11.005.
[14] F. Sato, Y. Shimada, F. M. Selaru, D. Shibata. et al.(2005). Prediction of survival in patients with esophageal carcinoma using artificial neural networks. Cancer.103(8):1596-1605. DOI: 10.1016/j.csbj.2014.11.005.
[15] R. C. Holte. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning.11(1):63-90. DOI: 10.1016/j.csbj.2014.11.005.
[16] G. Demiröz, H. Güvenir. Classification by voting feature intervals. :85-92. DOI: 10.1016/j.csbj.2014.11.005.
[17] L. Breiman. (2001). Random forests. Machine Learning.45(1):5-32. DOI: 10.1016/j.csbj.2014.11.005.
[18] Y. Freund, R. E. Schapire. Experiments with a new boosting algorithm. .96:148-156. DOI: 10.1016/j.csbj.2014.11.005.
[19] T. Ando, M. Suguro, T. Hanai, T. Kobayashi. et al.(2002). Fuzzy neural network applied to gene expression profiling for predicting the prognosis of diffuse large B-cell lymphoma. Japanese Journal of Cancer Research.93(11):1207-1212. DOI: 10.1016/j.csbj.2014.11.005.
[20] O. Rozenblatt-Rosen, R. C. Deo, M. Padi, G. Adelmant. et al.(2012). Interpreting cancer genomes using systematic host network perturbations by tumour virus proteins. Nature.487(7408):491-495. DOI: 10.1016/j.csbj.2014.11.005.
[21] L. P. Petalidis, A. Oulas, M. Backlund, M. T. Wayland. et al.(2008). Improved grading and survival prediction of human astrocytic brain tumors by artificial neural network analysis of gene expression microarray data. Molecular Cancer Therapeutics.7(5):1013-1024. DOI: 10.1016/j.csbj.2014.11.005.
[22] S.-W. Chang, S. Abdul-Kareem, A. F. Merican, R. B. Zain. et al.(2013). Oral cancer prognosis based on clinicopathologic and genomic markers using a hybrid of feature selection and machine learning methods. BMC Bioinformatics.14(1):170. DOI: 10.1016/j.csbj.2014.11.005.
[23] Y.-C. Chen, W.-C. Ke, H.-W. Chiu. (2014). Risk classification of cancer survival using ANN with gene expression data from multiple laboratories. Computers in Biology and Medicine.48:1-7. DOI: 10.1016/j.csbj.2014.11.005.
[24] A. Bashiri, M. Ghazisaeedi, R. Safdari, L. Shahmoradi. et al.(2017). Improving the prediction of survival in cancer patients by using machine learning techniques: experience of gene: a narrative review. Iranian Journal of Public Health.46(2):165-172. DOI: 10.1016/j.csbj.2014.11.005.
[25] T. Wang, J. Gu, J. Yuan, R. Tao. et al.(2013). Inferring pathway crosstalk networks using gene set co-expression signatures. Molecular BioSystems.9(7):1822-1828. DOI: 10.1016/j.csbj.2014.11.005.
[26] K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis. et al.(2015). Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal.13:8-17. DOI: 10.1016/j.csbj.2014.11.005.
[27] H. Wang, L. Huang, R. Jing, Y. Yang. et al.(2015). Identifying oncogenes as features for clinical cancer prognosis by Bayesian nonparametric variable selection algorithm. Chemometrics and Intelligent Laboratory Systems.146:464-471. DOI: 10.1016/j.csbj.2014.11.005.
[28] R. L. Grossman, A. P. Heath, V. Ferretti, H. E. Varmus. et al.(2016). Toward a shared vision for cancer genomic data. The New England Journal of Medicine.375(12):1109-1112. DOI: 10.1016/j.csbj.2014.11.005.
[29] L. Jiang, L. Huang, Q. Kuang, J. Zhang. et al.(2014). Improving the prediction of chemotherapeutic sensitivity of tumors in breast cancer via optimizing the selection of candidate genes. Computational Biology and Chemistry.49:71-78. DOI: 10.1016/j.csbj.2014.11.005.
[30] L. Shi, G. Campbell, W. D. Jones, F. Campagne. et al.(2010). The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nature Biotechnology.28(8):827-838. DOI: 10.1016/j.csbj.2014.11.005.
[31] F. Xie, M. He, L. He, K. Liu. et al.(2017). Bipartite network analysis reveals metabolic gene expression profiles that are highly associated with the clinical outcomes of acute myeloid leukemia. Computational Biology and Chemistry.67:150-157. DOI: 10.1016/j.csbj.2014.11.005.
[32] K. Kira, L. A. Rendell. A practical approach to feature selection. :249-256. DOI: 10.1016/j.csbj.2014.11.005.
[33] Z. Wen, Z. Wang, S. Wang, R. Ravula. et al.(2011). Discovery of molecular mechanisms of traditional Chinese medicinal formula Si-Wu-Tang using gene expression microarray and connectivity map. PloS One.6(3, article e18278). DOI: 10.1016/j.csbj.2014.11.005.
[34] A. McCallum, K. Nigam. A comparison of event models for naive bayes text classification. .752:41-48. DOI: 10.1016/j.csbj.2014.11.005.
[35] A. Genkin, D. D. Lewis, D. Madigan. (2007). Large-scale Bayesian logistic regression for text categorization. Technometrics.49(3):291-304. DOI: 10.1016/j.csbj.2014.11.005.
[36] M. Uhlén, L. Fagerberg, B. M. Hallström, C. Lindskog. et al.(2015). Tissue-based map of the human proteome. Science.347(6220, article 1260419). DOI: 10.1016/j.csbj.2014.11.005.
[37] G. H. John, P. Langley. Estimating continuous distributions in Bayesian classifiers. :338-345. DOI: 10.1016/j.csbj.2014.11.005.
[38] C. E. Shannon. (1948). A mathematical theory of communication, part I, part II. Bell System Technical Journal.27(4):623-656. DOI: 10.1016/j.csbj.2014.11.005.
[39] R. Kohavi. The power of decision tables. :174-189. DOI: 10.1016/j.csbj.2014.11.005.
[40] L. Breiman. (1996). Bagging predictors. Machine Learning.24(2):123-140. DOI: 10.1016/j.csbj.2014.11.005.
文献评价指标
浏览 452次
下载全文 56次
评分次数 0次
用户评分 0.0分
分享 15次