Open Access Open Access  Restricted Access Subscription Access

Prediction of Lung Cancer Subtypes with Integrated Parallelized Clustering and Classification on Spark

Lokeswari Venkataramana

Abstract


Data Mining techniques were used to mine unknown knowledge from Microarray gene expression data. Microarray gene data is huge in volume, so applying traditional data mining algorithms on Microarray gene data is time consuming. Microarray gene expression data has many genes and very few training instances. Feature selection helps to eliminate the unnecessary genes and selects only the prominent genes for diagnosing a disease. Parallel programming framework called Apache Spark was exploited to parallelize data mining algorithms. Spark is well suited for iterative processing. Parallelized computational methods namely feature selection, classification and integrated clustering with classification were used to reduce the time complexity and improve prediction accuracy. Chi-Square gene selection algorithm was used to select the optimal number of important genes. Clustering and classification algorithms were applied separately to the selected set of optimal genes and accuracy was calculated. Integrated clustering and classification algorithms were applied to selected set of optimal genes and accuracy was calculated. The accuracy obtained from clustering and classification applied separately was compared with the accuracy obtained from combined clustering and classification algorithms. Individual algorithms provided better accuracy than integrated algorithms for Lung Cancer data set.


Full Text:

PDF

References


T. R. Golub, D. K.Slonim, P.Tamayo, C.Huard, M.Gaasenbeek, J. P.Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri & C.D. Bloomfield, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science, (1999), vol. 286(5439), pp. 531-537.

H. M. Alshamlan, G. H. Badr & Y. Alohali, A study of cancer microarray gene expression profile: objectives and approaches. In Proceedings of the World Congress on Engineering, (2013). vol. 2, pp. 1-6.

S. A. Knuden, Biologists Guide to Analysis of DNA Microarray Data, John Wiley and Sons. (2002).

V. Marx, The big challenges of big data, Nature, vol. 498, (2013), vol. 7453, pp.255-260.

A. K. Koliopoulos, P.Yiapanis, F.Tekiner, G.Nenadic & J. Keane, A parallel distributed weka framework for big data mining using Spark. In IEEE International Congress on Big Data (BigData Congress), (2015), pp. 9-16.

M. Li, J.Tan, Y. Wang, L. Zhang & V. Salapura, Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In Proceedings of the 12th ACM International Conference on Computing Frontiers, (2015, May), p. 53.

S. M. Winkler, M. Affenzeller & H. Stekel, An integrated clustering and classification approach for the analysis of tumor patient data. In International Conference on Computer Aided Systems Theory Springer, Berlin, Heidelberg. (2013, February), (pp. 388-395).

M. C. de Souto, I. G. Costa, D. S. de Araujo, T. B. Ludermir, & A. Schliep, Clustering cancer gene expression data: a comparative study. BMC bioinformatics, (2008). vol. 9(1), pp. 497.

L. Zhou, H. Wang, & W. Wang, Parallel implementation of classification algorithms based on cloud computing environment. Indonesian Journal of Electrical Engineering and Computer Science, (2012), vol. 10(5), pp. 1087-1092.

W. Zhao, H. Ma, & Q. He, Parallel k-means clustering based on mapreduce. In IEEE International Conference on Cloud Computing Springer, Berlin, Heidelberg. (2009, December), pp. 674-679.

I. Mavridis, & H. Karatza, Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. Journal of Systems and Software, (2017), vol. 125, pp. 133-151.

A. T.Islam, B. S.Jeong, A. G. Bari, C. G. Lim & S. H.Jeon, MapReduce based parallel gene selection method. Applied Intelligence, (2015). vol. 42(2), pp. 147-156.

D. Peralta, S. del Río, S. Ramírez-Gallego, I.Triguero, J. M Benitez, & F. Herrera, Evolutionary feature selection for big data classification: A mapreduce approach. Mathematical Problems in Engineering, (2015). vol 2015, pp. 1-11

Y. Liu, W. K.Liao, A. N. Choudhary & J. Li, Parallel Data Mining Algorithms for Association Rules and Clustering, (2007).

R.Geetha Ramani, & S. Gracia Jacob, Prediction of cancer rescue p53 mutants in silico using Naïve Bayes learning methodology. Protein and peptide letters, (2013). vol. 20(11), pp. 1280-1291.

V. D.Katkar & S. V. Kulkarni, A novel parallel implementation of Naive Bayesian classifier for Big Data, In IEEE International Conference on Green Computing, Communication and Conservation of Energy (ICGCE), (2013), pp. 847-852.

S. R. Pakize, & A. Gandomi, Comparative study of classification algorithms based on MapReduce model. International Journal of Innovative Research in Advanced Engineering (IJIRAE), (2014), vol. 1(7), pp. 251-254

Parallel Naive Bayesian Classifier: https://alitarhini.wordpress.com/2011/03/02/parallel-naive-bayesian-classifier/ (23 May, 2018, date last Accessed)

Artificial Neural Network on Spark: https://dzone.com/articles/apache-spark-machine-learning-using-artificial-neu (8 May, 2018, date last accessed)

Artificial Intelligence Orange Labs. Ljubljana. http://www.biolab.si/supp/bi-cancer/projections/ (23 May, 2018, date last accessed).

A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M.Schummer,& Z. Yakhini, Tissue classification with gene expression profiles. Journal of computational biology, (2000). vol. 7(3-4), pp. 559-583.

T.Jirapech-Umpai, & S. Aitken, Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC bioinformatics, (2005), vol. 6(1), pp. 148.

T. Li, C. Zhang, & M. Ogihara, A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics, (2004), vol. 20(15), pp. 2429-2437.

A. C. Lorena, I. G.Costa & M. C. de Souto, On the complexity of gene expression classification data sets. In IEEE Eighth International Conference on Hybrid Intelligent Systems, 2008. HIS'08, (2008, September). pp. 825-830.

Parallel Programming Framework Apache Spark. http://spark.apache.org/ (22 May, 2018, date last accessed).

Parallel Programming Framework Spark. Machine Learning Library (SparkMLlib). http://spark.apache.org/docs/latest/mllib-guide.html (22 May, 2018, date last accessed)


Refbacks

  • There are currently no refbacks.