کاهش بعد داده‌های توالی جایگاه‌های پیوند روی ژنوم انسان با استفاده از شبکه‌ی عصبی عمیق اتوانکودر

بانکی کشکی, حسین; سیدصالحی, سیدعلی; زارع میرک‌آباد, فاطمه

doi:10.22041/ijbme.2018.75885.1294

نوع مقاله : مقاله کامل پژوهشی

نویسندگان

¹ دانشجوی کارشناسی ارشد مهندسی پزشکی، گروه بیوالکتریک، دانشکده‌ی مهندسی پزشکی، دانشگاه صنعتی امیرکبیر، تهران

² دانشیار، گروه بیوالکتریک، دانشکده‌ی مهندسی پزشکی، دانشگاه صنعتی امیرکبیر، تهران

³ استادیار، دانشکده‌ی ریاضی و علوم کامپیوتر، دانشگاه صنعتی امیرکبیر، تهران

https://doi.org/10.22041/ijbme.2018.75885.1294

چکیده

استفاده از توالیهای نوکلئوتیدی ژنوم به عنوان سیگنالهای بیوشیمیایی در روشهای یادگیری ماشین، با تبدیل این توالیها به کدهای عددی امکانپذیر است و این تبدیل باعث افزایش غیرواقعی بعد دادهها شده و انجام عملیاتهای تحلیل داده، مانند بصریسازی و استخراج ویژگی را با محدودیتهایی روبه‌رو میسازد. از این‌رو، باید با استفاده از روشهای کاهش بعد، دادهها را به فضای واقعی برگرداند. در این پژوهش از یک شبکه‌ی عصبی عمیق اتوانکودر به منظور کاهش بعد دادههای توالی مربوط به جایگاههای پیوند روی ژنوم انسان استفاده شده است. به منظور بررسی میزان حفظ اطلاعات دادههای اصلی در دادههای کاهش بعد یافته، از یک طبقهبندی دوکلاسه به وسیله‌ی ماشین بردار پشتیبان استفاده میشود. نتایج به دست آمده نشان میدهد که اطلاعات تقریبا به طور کامل در فشردهسازی حفظ میشود. سپس از دادههای فشرده‌شده برای بصریسازی و هم‌چنین انتخاب ویژگی با تحلیل واریانس استفاده میشود. نتایج به دست آمده نشان میدهد که مکانهای اول، دهم و هشتم در توالیها دارای بیشترین اطلاعات هستند. درحالی‌که عمده‌ی پژوهشهای پیشین روی دادههای بیان ژن حاصل از میکروآرایه، متمرکز شدهاند و مقایسه‌ی محدودی بین روشهای کاهش بعد در آن‌ها انجام شده است. این مقاله برای نخستین بار، دادههای نوکلئوتیدی توالی را با شبکه‌ی اتوانکودر، کاهش بعد داده و مقایسه‌ی جامعی بین انواع روشهای کاهش بعد و یادگیری ماشین ارائه میدهد.

کلیدواژه‌ها

موضوعات

بیوانفورماتیک / زیست‌داده‌ورزی

عنوان مقاله [English]

Dimensionality Reduction of Binding Site Sequence Data on Human Genome Using a Deep Autoencoder Neural Network

نویسندگان [English]

Hossein Bankikoshki ¹
Seyed Ali Seyyedsalehi ²
Fatemeh Zare Mirakabad ³

¹ MSc Student, Bioelectric Department, Biomedical Engineering Faculty, Amirkabir University of Technology, Tehran, Iran

² Associate Professor, Bioelectric Department, Biomedical Engineering Faculty, Amirkabir University of Technology, Tehran, Iran

³ Assistant Professor, Faculty of Mathematics & Computer Sciences, Amirkabir University of Technology, Tehran, Iran

چکیده [English]

The use of genomic nucleotide sequences as biochemical signals in machine learning methods is possible by converting these sequences into numerical codes. This conversion results in an unrealistic increase in the dimension of the data and encounters some data analysis operations such as visualization and feature extraction with constraints. Therefore, one should use the dimensionality reduction technics in order to return the data to its real dimension. In this study, a deep autoencoder neural network has been used to reduce the dimension of binding site sequence data on the human genome. In order to determine whether the information of real data is preserved in compressed data, we perform a two-class classification using a support vector machine. The results show that information is almost entirely preserved in compression. Then, compressed data is used for visualization as well as feature selection by analysis of variance. The results show that the first, the tenth and eighth positions in the sequences are the most informative positions. While the majority of the previous works deal with gene expression data of microarrays and compare a few dimension reduction algorithms, this paper for the first time uses an autoencoder on nucleotide sequence data and provides a comprehensive comparison between the performance of the dimension reduction technics and machine learning algorithms.

کلیدواژه‌ها [English]

Autoencoder
Dimensionality Reduction
Genome Sequence
Classification
feature selection

مراجع

[1] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, “Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning,” Nat. Biotechnol., vol. 33, no. 8, pp. 831–838, 2015.

[2] D. R. Kelley, J. Snoek, and J. L. Rinn, “Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks,” Genome Res., vol. 26, no. 7, pp. 990–999, 2016.

[3] S. Zhang et al., “A deep learning framework for modeling structural features of RNA-binding protein targets,” Nucleic Acids Res., vol. 44, no. 4, pp. 1–14, 2015.

[4] S. Inukai, K. H. Kock, and M. L. Bulyk, “Transcription factor–DNA binding: beyond binding site motifs,” Curr. Opin. Genet. Dev., vol. 43, pp. 110–119, 2017.

[5] N. Jayaram, D. Usvyat, and A. C. R. Martin, “Evaluating tools for transcription factor binding site prediction,” BMC Bioinformatics, no. i, pp. 1–12, 2016.

[6] D. Liu, X. Xiong, B. DasGupta, and H. Zhang, “Motif discoveries in unaligned molecular sequences using self-organizing neural networks,” IEEE Trans. Neural Networks, vol. 17, no. 4, pp. 919–928, 2006.

[7] G. E. Hinton, “Reducing the Dimensionality of Data with Neural Networks,” Science (80-. )., vol. 313, no. 5786, pp. 504–507, 2006.

[8] Y. Bengio, “Learning Deep Architectures for AI,” Found. Trends® Mach. Learn., vol. 2, no. 1, pp. 1–127, 2009.

[9] M. A. Kramer, “Nonlinear principal component analysis using autoassociative neural networks,” AIChE J., vol. 37, no. 2, pp. 233–243, 1991.

[10] L. J. P. Van der Maaten and U. Maastricht, “An introduction to dimensionality reduction using matlab,” Report, vol. 1201, no. 07–07, p. 62, 2007.

[11] L. J. P. Van Der Maaten, E. O. Postma, and H. J. Van Den Herik, “Dimensionality Reduction: A Comparative Review,” J. Mach. Learn. Res., vol. 10, pp. 1–41, 2009.

[12] K. Y. Yeung and W. L. Ruzzo, “Principal component analysis for clustering gene expression data,” Bioinformatics, vol. 17, no. 9, pp. 763–774, 2001.

[13] I. Pournara and L. Wernisch, “Factor analysis for gene regulatory networks and transcription factor activity profiles,” BMC Bioinformatics, vol. 8, no. 1, p. 61, 2007.

[14] A. Sharma and K. K. Paliwal, “Cancer classification by gradient LDA technique using microarray gene expression data,” Data Knowl. Eng., vol. 66, no. 2, pp. 338–347, 2008.

[15] D. Chicco, P. Sadowski, and P. Baldi, “Deep autoencoder neural networks for gene ontology annotation predictions,” in Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 2014, pp. 533–540.

[16] L. Chen, C. Cai, V. Chen, and X. Lu, “Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model,” in BMC bioinformatics, 2016, vol. 17, no. 1, p. S9.

[17] J. Tan, J. H. Hammond, D. A. Hogan, and C. S. Greene, “ADAGE-based integration of publicly available Pseudomonas aeruginosa gene expression data with denoising autoencoders illuminates microbe-host interactions,” MSystems, vol. 1, no. 1, pp. e00025-15, 2016.

[18] J. Tan, M. Ung, C. Cheng, and C. S. Greene, “Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders,” in Pacific Symposium on Biocomputing Co-Chairs, 2014, pp. 132–143.

[19] H. Cui, C. Zhou, X. Dai, Y. Liang, R. Paffenroth, and D. Korkin, “Boosting Gene Expression Clustering with System-Wide Biological Information: A Robust Autoencoder Approach,” bioRxiv, p. 214122, 2017.

[20] “NCBI Genome Database,” 2017. [Online]. Available: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/. [Accessed: 11-Mar-2017].

[21] “FANTOM4 TFBS Data,” 2017. [Online]. Available: http://fantom.gsc.riken.jp/4/download/GenomeBrowser/hg18/TFBS_CAGE/.

[22] Y. Bengio, “Practical recommendations for gradient-based training of deep architectures,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 7700 LECTU, pp. 437–478, 2012.

[23] D. Erhan, A. Courville, and P. Vincent, “Why Does Unsupervised Pre-training Help Deep Learning ?,” J. Mach. Learn. Res., vol. 11, pp. 625–660, 2010.

[24] S. Z. Seyyedsalehi and S. A. Seyyedsalehi, “A fast and efficient pre-training method based on layer-by-layer maximum discrimination for deep neural networks,” Neurocomputing, vol. 168, pp. 669–680, 2015.

نشریه‌ی علمی مهندسی پزشکی زیستی

کاهش بعد داده‌های توالی جایگاه‌های پیوند روی ژنوم انسان با استفاده از شبکه‌ی عصبی عمیق اتوانکودر

مراجع

مراجع

دوره 11، شماره 3
آبان 1396
صفحه 219-230

کاهش بعد داده‌های توالی جایگاه‌های پیوند روی ژنوم انسان با استفاده از شبکه‌ی عصبی عمیق اتوانکودر

مراجع

مراجع

دوره 11، شماره 3آبان 1396صفحه 219-230

دوره 11، شماره 3
آبان 1396
صفحه 219-230