Document Type : Full Research Paper


1 MSc Student, Bioelectric Department, Biomedical Engineering Faculty, Amirkabir University of Technology, Tehran, Iran

2 Associate Professor, Bioelectric Department, Biomedical Engineering Faculty, Amirkabir University of Technology, Tehran, Iran

3 Assistant Professor, Faculty of Mathematics & Computer Sciences, Amirkabir University of Technology, Tehran, Iran


The use of genomic nucleotide sequences as biochemical signals in machine learning methods is possible by converting these sequences into numerical codes. This conversion results in an unrealistic increase in the dimension of the data and encounters some data analysis operations such as visualization and feature extraction with constraints. Therefore, one should use the dimensionality reduction technics in order to return the data to its real dimension. In this study, a deep autoencoder neural network has been used to reduce the dimension of binding site sequence data on the human genome. In order to determine whether the information of real data is preserved in compressed data, we perform a two-class classification using a support vector machine. The results show that information is almost entirely preserved in compression. Then, compressed data is used for visualization as well as feature selection by analysis of variance. The results show that the first, the tenth and eighth positions in the sequences are the most informative positions. While the majority of the previous works deal with gene expression data of microarrays and compare a few dimension reduction algorithms, this paper for the first time uses an autoencoder on nucleotide sequence data and provides a comprehensive comparison between the performance of the dimension reduction technics and machine learning algorithms.


Main Subjects

[1]     B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, “Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning,” Nat. Biotechnol., vol. 33, no. 8, pp. 831–838, 2015.

[2]     D. R. Kelley, J. Snoek, and J. L. Rinn, “Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks,” Genome Res., vol. 26, no. 7, pp. 990–999, 2016.

[3]     S. Zhang et al., “A deep learning framework for modeling structural features of RNA-binding protein targets,” Nucleic Acids Res., vol. 44, no. 4, pp. 1–14, 2015.

[4]     S. Inukai, K. H. Kock, and M. L. Bulyk, “Transcription factor–DNA binding: beyond binding site motifs,” Curr. Opin. Genet. Dev., vol. 43, pp. 110–119, 2017.

[5]     N. Jayaram, D. Usvyat, and A. C. R. Martin, “Evaluating tools for transcription factor binding site prediction,” BMC Bioinformatics, no. i, pp. 1–12, 2016.

[6]     D. Liu, X. Xiong, B. DasGupta, and H. Zhang, “Motif discoveries in unaligned molecular sequences using self-organizing neural networks,” IEEE Trans. Neural Networks, vol. 17, no. 4, pp. 919–928, 2006.

[7]     G. E. Hinton, “Reducing the Dimensionality of Data with Neural Networks,” Science (80-. )., vol. 313, no. 5786, pp. 504–507, 2006.

[8]     Y. Bengio, “Learning Deep Architectures for AI,” Found. Trends® Mach. Learn., vol. 2, no. 1, pp. 1–127, 2009.

[9]     M. A. Kramer, “Nonlinear principal component analysis using autoassociative neural networks,” AIChE J., vol. 37, no. 2, pp. 233–243, 1991.

[10] L. J. P. Van der Maaten and U. Maastricht, “An introduction to dimensionality reduction using matlab,” Report, vol. 1201, no. 07–07, p. 62, 2007.

[11] L. J. P. Van Der Maaten, E. O. Postma, and H. J. Van Den Herik, “Dimensionality Reduction: A Comparative Review,” J. Mach. Learn. Res., vol. 10, pp. 1–41, 2009.

[12] K. Y. Yeung and W. L. Ruzzo, “Principal component analysis for clustering gene expression data,” Bioinformatics, vol. 17, no. 9, pp. 763–774, 2001.

[13] I. Pournara and L. Wernisch, “Factor analysis for gene regulatory networks and transcription factor activity profiles,” BMC Bioinformatics, vol. 8, no. 1, p. 61, 2007.

[14] A. Sharma and K. K. Paliwal, “Cancer classification by gradient LDA technique using microarray gene expression data,” Data Knowl. Eng., vol. 66, no. 2, pp. 338–347, 2008.

[15] D. Chicco, P. Sadowski, and P. Baldi, “Deep autoencoder neural networks for gene ontology annotation predictions,” in Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 2014, pp. 533–540.

[16] L. Chen, C. Cai, V. Chen, and X. Lu, “Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model,” in BMC bioinformatics, 2016, vol. 17, no. 1, p. S9.

[17] J. Tan, J. H. Hammond, D. A. Hogan, and C. S. Greene, “ADAGE-based integration of publicly available Pseudomonas aeruginosa gene expression data with denoising autoencoders illuminates microbe-host interactions,” MSystems, vol. 1, no. 1, pp. e00025-15, 2016.

[18] J. Tan, M. Ung, C. Cheng, and C. S. Greene, “Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders,” in Pacific Symposium on Biocomputing Co-Chairs, 2014, pp. 132–143.

[19] H. Cui, C. Zhou, X. Dai, Y. Liang, R. Paffenroth, and D. Korkin, “Boosting Gene Expression Clustering with System-Wide Biological Information: A Robust Autoencoder Approach,” bioRxiv, p. 214122, 2017.

[20] “NCBI Genome Database,” 2017. [Online]. Available: [Accessed: 11-Mar-2017].

[21] “FANTOM4 TFBS Data,” 2017. [Online]. Available:

[22] Y. Bengio, “Practical recommendations for gradient-based training of deep architectures,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 7700 LECTU, pp. 437–478, 2012.

[23] D. Erhan, A. Courville, and P. Vincent, “Why Does Unsupervised Pre-training Help Deep Learning ?,” J. Mach. Learn. Res., vol. 11, pp. 625–660, 2010.

[24] S. Z. Seyyedsalehi and S. A. Seyyedsalehi, “A fast and efficient pre-training method based on layer-by-layer maximum discrimination for deep neural networks,” Neurocomputing, vol. 168, pp. 669–680, 2015.