نوع مقاله : مقاله کامل پژوهشی

نویسندگان

1 دانشجوی دکتری مهندسی کامپیوتر، گروه مهندسی کامپیوتر، پردیس فنی و مهندسی، دانشگاه یزد، یزد، ایران

2 دانشیار، گروه مهندسی کامپیوتر، پردیس فنی و مهندسی، دانشگاه یزد، یزد، ایران

3 استادیار، گروه مهندسی کامپیوتر، دانشکده‌ی فنی و مهندسی، دانشگاه اردکان، اردکان، ایران

4 استادیار، گروه مهندسی کامپیوتر، پردیس فنی و مهندسی، دانشگاه یزد، یزد، ایران

چکیده

انتخاب ویژگی یکی از فرایندهای پیش‌پردازش داده‌ها در مباحث مربوط به یادگیری ماشین و داده‌کاوی به شمار می‌رود که در برخی زمینه‌ها مانند کار با داده‌های ریزآرایه در بیوانفورماتیک که با مشکل ابعاد بالای داده‌ها در مقابل تعداد کم نمونه‌ها مواجه است، از اهمیت ویژه‌ای برخوردار می‌باشد. انتخاب ویژگی‌های (ژن‌های) موثر در تشخیص بیماری از داده‌های ریزآرایه نقش مهمی در تشخیص زودهنگام بیماری و راه‌های مواجهه با آن ایفا می‌کند. در روش‌های انتخاب ویژگی مبتنی بر تئوری اطلاعات که طیف گسترده‌ای از روش‌های انتخاب ویژگی را شامل می‌شوند، از مفهوم بی‌نظمی برای تعریف معیارهای مرتبط بودن، افزونگی و مکمل بودن ویژگی‌ها استفاده می‌شود. در این مقاله به جای بی‌نظمی از مفهوم پیوستگی خالص برای پیشنهاد یک معیار جدید مرتبط بودن استفاده شده است. در این معیار پیشنهادی، برای کنترل و کاهش افزونگی، ارتباط یک ویژگی با تک‌تک کلاس‌ها به طور جداگانه بررسی شده است در حالی که در اکثر روش‌های فیلتر، ارزش یک ویژگی بر اساس ارتباط آن با کل کلاس‌ها سنجیده می‌شود. این راه‌کار باعث شده که ویژگی‌های موثر در هر کلاس به تفکیک شناسایی شوند، در حالی که امکان شناسایی ویژگی‌های مشترک نیز وجود دارد. یکی دیگر از مشکل‌های موجود در برخی از روش‌ها، مساله‌ی گسسته‌سازی داده‌ها  است. در روش پیشنهادی این مقاله، با استفاده از یک تبدیل مبتنی بر یک‌ریختی، ضمن استفاده از مزایای گسسته‌سازی، از درگیر شدن با پیچیدگی‌های آن نیز اجتناب شده است. برای مقایسه‌ی روش پیشنهادی با تعدادی از روش‌های مرتبط، از هفت مجموعه‌ی داده‌ی ریزآرایه مربوط به انواع سرطان به همراه سه دسته‌بند پرکاربرد بیزین ساده، k-نزدیک‌ترین همسایه و ماشین بردار پشتیبان استفاده شده است. نتایج تجربی نشان دهنده‌ی کارایی روش ارائه شده بر اساس دو پارامتر دقت دسته‌بندی و تعداد ژن‌های انتخابی می‌باشد. 

کلیدواژه‌ها

عنوان مقاله [English]

Feature Selection based on Information Theory to Select Effective Genes for Diagnosis of Cancer Subtypes using Microarray Data

نویسندگان [English]

  • Abolfazl Tabatabaei 1
  • Vali Derhami 2
  • Razieh Sheikhpour 3
  • Mohammad-Reza Pajoohan 4

1 Ph.D. Student, Department of Computer Engineering, Faculty of Engineering, Yazd University, Yazd, Iran

2 Associate Professor, Department of Computer Engineering, Faculty of Engineering, Yazd University, Yazd, Iran

3 Assistant Professor, Department of Computer Engineering, Faculty of Engineering, Ardakan University, Ardakan, Iran

4 Assistant Professor, Department of Computer Engineering, Faculty of Engineering, Yazd University, Yazd, Iran

چکیده [English]

Feature selection is a well-known preprocessing technique in machine learning, data mining and especially bioinformatics microarray analysis with a high-dimension, low-sample-size (HDLSS) data. The diagnosis of genes responsible for disease using microarray data is an important issue to promoting knowledge about the mechanism of disease and improves the way of dealing with the disease. In feature selection methods based on information theory, which cover a wide range of feature selection methods, the concept of entropy is used to define criteria for relevance, redundancy and complementarity. In this paper, we propose a new relevancy criterion based on the concept of pure continuity rather than the concept of entropy. In the proposed method, to control and reduce redundancy, the relevancy between a feature and each class is separately examined, while in most of the filter methods the value of a feature is measured based on its relation to the entire class. This solution allows us to identify the most efficient features (genes) of each class separately, while identifying common features (genes) is also possible. Discretization is another challenge in some available techniques. Using a homomorphism transformation in proposed method avoids engaging with discretization complexities, while taking advantages of it. Seven types of cancer microarrays with three types of classification models (e.g. NB, KNN and SVM) are used to establish a comparison between the proposed method and other relevant methods. The results confirm the efficiency of the proposed method in the term of accuracy and number of selected genes as two parameters of classification.

کلیدواژه‌ها [English]

  • Feature Selection
  • Effective Genes
  • Cancer Diagnosis
  • Microarray Data
  • Machine Learning
  • Classification
[1]   G. Piatetsky-Shapiro and P. Tamayo, “Microarray data mining: facing the challenges,” ACM SIGKDD Explor. Newsl., vol. 5, no. 2, pp. 1–5, 2003.
[2]   L. Zhang and X. Lin, “Some considerations of classification for high dimension low-sample size data,” Stat. Methods Med. Res., vol. 22, no. 5, pp. 537–550, 2011.
[3]   I. Guyon and A. Elisseeff, “An Introduction to Variable and Feature Selection,” J. Mach. Learn. Res., vol. 3, no. 3, pp. 1157–1182, 2003.
[4]   A. M. Glazier, “Finding Genes That Underlie Complex Traits,” Science (80-. )., vol. 298, no. 5602, pp. 2345–2349, 2002.
[5]   Y. Saeys, I. Inza, and P. Larrañaga, “A review of feature selection techniques in bioinformatics,” Bioinformatics, vol. 23, no. 19. pp. 2507–2517, 2007.
[6]   T. R. Golub et al., “Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring,” Science (80-. )., vol. 286, no. 5439, pp. 531–527, 1999.
[7]   R. Clarke et al., “The properties of high-dimensional data spaces: implications for exploring gene and protein expression data,” Nat. Rev. Cancer, vol. 8, no. 1, pp. 37–49, 2008.
[8]   M. Köppen, “The curse of dimensionality,” 5th Online World Conf. Soft Comput. Ind. Appl., vol. 1, pp. 4–8, 2000.
[9]   L. Huan and H. Motoda, “Feature extraction, construction and selection: A data mining perspective,” Comput. Math. with Appl., vol. 38, no. 1, p. 125, 1999.
[10]A. M. Martinez and A. C. Kak, “PCA versus LDA,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 2, pp. 228–233, 2001.
[11]VIJAY SUNDER NAGA PAPPU, Supervised machine learning models for feature selection and classification on high dimensional datasets. 2013.
[12]J. Li et al., “Feature selection: A data perspective,” ACM Comput. Surv., vol. 50, no. 6, p. 94, 2017.
[13]V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos, J. M. Benítez, and F. Herrera, “A review of microarray datasets and applied feature selection methods,” Inf. Sci. (Ny)., 2014.
[14]B. Venkatesh and J. Anuradha, “A Review of Feature Selection and Its Methods,” Cybern. Inf. Technol., vol. 19, no. 1, pp. 3–26, 2019.
[15]N. Almugren and H. Alshamlan, “A Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification,” IEEE Access, vol. 7, pp. 78533–78548, 2019.
[16]S. Begum, A. A. Ansari, S. Sultan, and R. Dam, “A Hybrid Model for Optimum Gene Selection of Microarray Datasets,” in Recent Developments in Machine Learning and Data Analytics, Springer, 2019, pp. 423–430.
[17]M. Ghosh, S. Adhikary, K. K. Ghosh, A. Sardar, S. Begum, and R. Sarkar, “Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods,” Med. Biol. Eng. Comput., vol. 57, no. 1, pp. 159–176, 2019.
[18]A. K. Shukla, P. Singh, and M. Vardhan, “A new hybrid feature subset selection framework based on binary genetic algorithm and information theory,” Int. J. Comput. Intell. Appl., p. 1950020, 2019.
[19]C. Yan, J. Ma, H. Luo, and A. Patel, “Hybrid binary coral reefs optimization algorithm with simulated annealing for feature selection in high-dimensional biomedical datasets,” Chemom. Intell. Lab. Syst., vol. 184, pp. 102–111, 2019.
[20]A. Chinnaswamy and R. Srinivasan, “Hybrid feature selection using correlation coefficient and particle swarm optimization on microarray gene expression data,” in Innovations in Bio-Inspired Computing and Applications, Springer, 2016, pp. 229–239.
[21]P. Ghamisi and J. A. Benediktsson, “Feature selection based on hybridization of genetic algorithm and particle swarm optimization,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 2, pp. 309–313, 2015.
[22]I. Jain, V. K. Jain, and R. Jain, “Correlation feature selection based improved-Binary Particle Swarm Optimization for gene selection and cancer classification,” Appl. Soft Comput., vol. 62, pp. 203–215, 2018.
[23]S. S. Shreem, S. Abdullah, M. Z. A. Nazri, and M. Alzaqebah, “Hybridizing relief, mRMR filters and GA wrapper approaches for gene selection,” J. Theor. Appl. Inf. Technol., vol. 47, no. 3, pp. 1338–1343, 2013.
[24]S. Kamyab and M. Eftekhari, “Feature selection using multimodal optimization techniques,” Neurocomputing, vol. 171, pp. 586–597, 2016.
[25]Z. Ma et al., “Discriminating joint feature analysis for multimedia data understanding,” IEEE Trans. Multimed., vol. 14, no. 6, pp. 1662–1672, 2012.
[26]C. Shi, Q. Ruan, and G. An, “Sparse feature selection based on graph Laplacian for web image annotation,” Image Vis. Comput., vol. 32, no. 3, pp. 189–201, 2014.
[27]J. R. Vergara and P. A. Estévez, “A review of feature selection methods based on mutual information,” Neural Comput. Appl., vol. 24, no. 1, pp. 175–186, 2014.
[28]R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley & Sons, 2012.
[29]C. E. Shannon, “A mathematical theory of communication,” SIGMOBILE Mob. Comput. Commun. Rev., vol. 5, no. 1, pp. 3–55, 2001.
[30]W. H. Vetterling, William T. and Teukolsky, Saul A. and Press, Numerical Recipes: Example Book (C), 2nd ed. Press Syndicate of the University of Cambridge, 1992.
[31]D. D. Lewis, “Feature selection and feature extraction for text categorization,” Proc. Speech Nat. Lang. Work., p. 212, 1992.
[32]R. Battiti, “Using Mutual Information for Selecting Features in Supervised Neural Net Learning,” IEEE Trans. Neural Networks, vol. 5, no. 4, pp. 537–550, 1994.
[33]Hanchuan Peng, Fuhui Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 1226–1238, 2005.
[34]G. Brown, A. Pocock, M.-J. Zhao, and M. Lujan, “Conditional Likelihood Maximisation: A Unifying Framework for Mutual Information Feature Selection,” J. Mach. Learn. Res., vol. 13, pp. 27–66, 2012.
[35]M. Hall, “Correlation-based Feature Selection for Machine Learning,” Methodology, 1999.
[36]L. Yu and H. Liu, “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution,” Int. Conf. Mach. Learn., 2003.
[37]R. Kerber, “Chimerge: Discretization of numeric attributes,” Proc. tenth Natl. Conf. Artif. Intell., 1992.
[38]K. Irani and U. Fayyad, “Multi-lnterval Discretization of Continuous-Valued Attributes for Classification learning,” Proc. Natl. Acad. Sci. U. S. A., 1993.
[39]L. A. Kurgan and K. J. Cios, “CAIM Discretization Algorithm,” IEEE Trans. Knowl. Data Eng., 2004.
[40]J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” in Machine Learning Proceedings 1995, 1995, pp. 194–202.
[41]S. Kotsiantis and D. Kanellopoulos, “Discretization Techniques : A recent survey,” GESTS Int. Trans. Comput. Sci. Eng., vol. 32, no. 1, pp. 47–58, 2006.
[42]S. García, J. Luengo, J. A. Sáez, V. López, and F. Herrera, “A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 4, pp. 734–750, 2013.
[43]S. Sharmin, A. A. Ali, M. A. H. Khan, and M. Shoyaib, “Feature Selection and Discretization based on Mutual Information,” in 2017 IEEE International Conference on Imaging, Vision & Pattern Recognition (icIVPR), 2017, pp. 1–6.
[44]H. Liu, F. Hussain, C. L. Tan, and M. Dash, “Discretization: An enabling technique,” Data Min. Knowl. Discov., vol. 6, no. 4, pp. 393–423, 2002.
[45]B. S. Chlebus and S. H. Nguyen, “On Finding Optimal Discretizations for Two Attributes BT  - Rough Sets and Current Trends in Computing,” 1998, pp. 537–544.
[46]C. C. Pinter, A Book of SET THEORY. Dover Publications, 2014.
[47]I. Kononenko, “Estimating attributes: analysis and extensions of RELIEF,” in European conference on machine learning, 1994, pp. 171–182.
[48]E. Witten, Ian H. and Frank, Data Mining: Practical Machine Learning Tools and Techniques. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2005.
[49]S. Wang and J. Wei, “Feature selection based on measurement of ability to classify subproblems,” Neurocomputing, vol. 224, pp. 155–165, 2017.
[50]D. G. Beer et al., “Gene-expression profiles predict survival of patients with lung adenocarcinoma,” Nat. Med., vol. 8, no. 8, p. 816, 2002.
[51]S. A. Armstrong et al., “MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia,” Nat. Genet., vol. 30, no. 1, p. 41, 2002.
[52]C. M. Perou et al., “Molecular portraits of human breast tumours,” Nature, vol. 406, no. 6797, p. 747, 2000.
[53]A. A. Alizadeh et al., “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature, vol. 403, no. 6769, p. 503, 2000.
[54]A. I. Su et al., “Molecular classification of human carcinomas by use of gene expression signatures,” Cancer Res., vol. 61, no. 20, pp. 7388–7393, 2001.
[55]S. Ramaswamy et al., “Multiclass cancer diagnosis using tumor gene expression signatures,” Proc. Natl. Acad. Sci., vol. 98, no. 26, pp. 15149–15154, 2001.