استخراج ویژگی‌های مبتنی بر مدل‌سازی خطی تراژکتوری گفتار جاسازی شده در فضای بازسازی شده فاز برای سیستم بازشناسی گفتار

نوع مقاله: مقاله کامل پژوهشی

نویسندگان

1 دانشجوی دکتری مهندسی پزشکی، گروه بیوالکتریک، دانشکده مهندسی‌ پزشکی، دانشگاه صنعتی امیرکبیر (پلی‌تکنیک ایران)

2 دانشیار، گروه بیوالکتریک، دانشکده مهندسی‌ پزشکی، دانشگاه صنعتی امیرکبیر (پلی‌تکنیک ایران)

10.22041/ijbme.2012.13096

چکیده

تحقیقات اخیر نشان می‌دهد که تظاهرات غیرخطی و آشوبی سیگنال گفتار می‌تواند در حوزة فضای بازسازی شده فاز (RPS) مطالعه شود. تئوری جاسازی برمبنای محورهای تأخیری، ابزار مناسبی برای بررسی تراژکتورهای گفتاری در RPS است. تاکنون از مشخصه‌های تراژکتورهای گفتاری به ندرت در سیستم‌های کاربردی بازشناسی گفتار استفاده شده است. از اینرو در این مقاله  روش استخراج ویژگی جدیدی براساس پارامترهای مدلسازی خطی مبتنی بر روش AR برداری (VAR) پیشنهاد شده است. در این روش بوسیله ماتریس ضرایب فیلتر و یا ضرایب انعکاسی به دست آمده از اعمال روش VAR بر مشخصه‌های استاتیک و دینامیک تراژکتوری های گفتاری شکل یافته در RPS، یک بردار ویژگی با بُعد زیاد حاصل می‌شود که می‌توان از روش‌های نگاشت خطی برای کاهش بُعد مناسب آن استفاده کرد. نتایج آزمایش‌های بازشناسی واج مجزا و پیوسته بر مجموعه دادگان گفتاری فارس‌دات نشان می‌دهد که کارایی این روش در مقایسه با دیگر روش‌های متداول استخراج ویژگی مبتنی بر حوزة زمان مانند روش LPC و LPREF بیشتر است. 

کلیدواژه‌ها

موضوعات


عنوان مقاله [English]

Feature Extraction based on Linear Modeling of Embedded Speech Trajectory in the Reconstructed Phase Space for Speech Recognition System

نویسندگان [English]

  • Yaser Shekofteh 1
  • Farshad Almasganj 2
1 Ph.D Candidate, Bioelectric Department, Faculty of Biomedical Engineering, Amirkabir University of Technology
2 Associate Professor, Bioelectric Department, Faculty of Biomedical Engineering, Amirkabir University of Technology
چکیده [English]

Recent researches show that nonlinear and chaotic behavior of the speech signal can be studied in the reconstructed phase space (RPS). Delay embedding theorem is a useful tool to study embedded speech trajectories in the RPS. Characteristics of the speech trajectories have rarely used in the practical speech recognition systems. Therefore, in this paper, a new feature extraction (FE) method is proposed based on parameters of vector AR (VAR) analysis over the speech trajectories. In this method, using filter and reflection matrices obtained from applying VAR analysis on static and dynamic information of the speech trajectory in the RPS, a high-dimensional feature vector can be achieved. Then, different transformation methods are utilized to attain final feature vectors with appropriate dimension. Results of discrete and continuous phoneme recognition over FARSDAT speech corpus show that the efficiency of the proposed FE method is better than other time-domain-based FE methods such as LPC and LPREF.

کلیدواژه‌ها [English]

  • Speech Recognition
  • Feature Extraction
  • reconstructed phase space
  • Signal Embedding
  • Linear Prediction
  • Vector AR
[1] Awrejcewicz J., Bifurcation portrait of the human vocal cord oscillation; Journal of Sound Vibrations, 1990; 136: 151–156.

[2] Berry, D.A., Herzel, H., Titze, I.R., Krischer K., Interpretation of biomechanical simulations of normal and chaotic vocal fold oscillations with empirical eigenfunctions; The Journal of the Acoustical Society of America, 1994; 95: 3595–3604.

[3] Herzel, H., Berry, D., Titze, I., Steinecke, I., Nonlinear dynamics of the voice: signal analysis and biomechanical modeling; Chaos, 1995; 5: 30–34.

[4] Jiang, J.J., Zhang, Y., Chaotic vibration induced by turbulent noise in a two-mass model of vocal folds; The Journal of the Acoustical Society of America,2002; 112: 2127–2133.

[5] Jiang, J.J., Zhang, Y., McGilligan, C., Chaos in voice, from modeling to measurement; Journal of Voice, 2006; 20(1): 2006; 2-17.

[6] Kokkinos, I., Maragos, P., Nonlinear speech analysis using models for chaotic systems; IEEE Trans. Speech Audio Processing, 2005; 13: 1098–1109.

[7] Hagmuller, M., Kubin, G., Poincare pitch marks. Speech Communication; 2006; 48: 1650–1665.

[8] Sun, J., Zheng, N., Wang, X., Enhancement of Chinese speech based on nonlinear dynamics; Signal Processing, 2007; 87: 2431–2445.

[9] Kantz, H., Schreiber, T., Nonlinear Time Series Analysis Cambridge University Press, Cambridge, England. 1997.

[10] Takens, F., Detecting strange attractors in turbulence; In Proc. Dynamical System Turbulence, 1980; pp. 366–381.

[11] Narayanan, S.S., Alwan, A.A., A nonlinear dynamical systems analysis of fricative consonants; Acoustical Society of America Journal, 1995; 97: 2511-2524.

[12] Shekofteh, Y., Almasganj, F., Using phase space based processing to extract proper features for ASR systems; In Proc. 5th International Symposium on Telecommunications (IST), 2010; pp. 596-599.

[13] Vaziri, G., Almasganj, F., Behroozmand, R., Pathological assessment of patients’ speech signals using nonlinear dynamical analysis; Computers in Biology and Medicine, 2010; 40(1): 54-63.

[14] Paliwal, K., Alsteris, L., On the usefulness of STFT phase spectrum in human listening tests; Speech Communication, 2005; 45: 153–170.

[15] Hegde, R. M., Murthy, H.A., Gadde, V.R.R., Significance of the modified group delay feature in speech recognition; IEEE Trans. Audio, Speech and Language Processing, 2007; 15(1): 190–202.

[16] Alsteris, L.D., Paliwal, K.K., Short-time phase spectrum in speech processing: A review and some experimental results; Digital Signal Processing, 2007; 17: 578–616.

[17] Pitsikalis, V., Maragos, P., Speech analysis and feature extraction using chaotic models. In Proc. ICASSP, Orlando, Florida, 2002; pp. 533-536.

[18] Pitsikalis, V., Maragos, P., Filtered dynamics and fractal dimensions for noisy speech recognition; Signal Processing Letters, 2006; 13(11): 711-714.

[19] Pitsikalis, V., Maragos, P., Analysis and classification of speech signals by generalized fractal dimension features; Speech Communication, 2009; 51(12): 1206-1223.

[20] Ezeiza, A., Ipina, K.L., Hernández, C., Barroso, N., Enhancing the feature extraction process for automatic speech recognition with fractal dimensions; Cognitive Computation, 2012; pp. 1-6.

[21] Yu, S., Zheng, D., Feng, X., A new time domain feature parameter for phoneme classification. In Proc. WESPAC IX 2006, Seoul, Korea. 2006.

[22] Narayanan, N.K., Thasleema, T.M., Prajith, P., Reconstructed state space model for recognition of consonant - vowel utterances using support vector machines; International Journal of Artificial Intelligence and Applications, 2012; 3(2): 101-119.

[23] Thasleema, T.M., Prajith, P., Narayanan, N.K., Time–domain non-linear feature parameter for consonant classification; International Journal of Speech Technology, 2012; 15(2): 227-239.

[24] Ye, J., Povinelli, R.J., Johnson, M.T., Phoneme classification using naive Bayes classifier in   reconstructed phase space; In Proc. IEEE Digital Signal Processing Workshop, Atlanta, Georgia. 2002.

[25] Ye, J., Johnson, M.T. M.T., Povinelli, R.J., Phoneme classification over reconstructed phase space using principal component analysis; In Proc. NOLISP, Le Croisic, France, 2003; pp. 11–16.

[26] Povinelli, R.J., Johnson, M.T., Lindgren, A.C., Ye, J., Time series classification using Gaussian mixture models of reconstructed phase spaces; IEEE Trans. Knowledge and Data Engineering, 2004; 16:779–783.

[27] Povinelli, R.J., Johnson, M.T., Lindgren, A.C., Roberts, F.M., Ye, J., Statistical models of reconstructed phase spaces for signal classification; IEEE Trans. Signal Processing, 2006; 54: 2178–2186.

[28] Jafari, A., Almasganj, F., NabiBidhendi, M., Statistical modeling of speech Poincaré sections in combination of frequency analysis to improve speech recognition performance; Chaos, 2010; 20(033106):1-11.

[29] Jafari, A., Almasganj, F., Using nonlinear modeling of reconstructed phase space and frequency domain analysis to improve automatic speech recognition performance; International Journal of Bifurcation and Chaos, 2012; 22(3).

[30] Shekofteh, Y., Almasganj, F., Feature extraction based on speech attractors in the reconstructed phase space for automatic speech recognition systems; ETRI Journal, 2013; 35(1): 100-108.

[31] Sauer, T., Yorke, J.A., Casdagli, M., Embedology; Journal of Statistical Physics, 1991; 65: 579–616.

[32] Kennel, M.B., Brown, R., Abarbanel, H.D.I., Determining embedding dimension for phase-space reconstruction using a geometrical construction; Physical review A, 1992; 45(6): 3403–3411.

[33] Abarbanel, H.D.I., Analysis of observed chaotic data; Springer, New York. 1996.

[34] Johnson, M.T., Povinelli, R.J., Lindgren, A.C., Ye, J., Liu, X., Indrebo, K.M., Time-domain isolated phoneme classification using reconstructed phase spaces; IEEE Trans. Speech Audio Processing, 2005; 13(4): 458–466.

[35] Banbrook, M., McLaughlin, S., Dynamical modelling of vowel sounds as a synthesis tool; In Proc. ICSLP, 1996; pp. 1981-1984.

[36] Indrebo, K.M., Povinelli, R.J., Johnson, M.T., Sub-banded reconstructed phase spaces for speech recognition; Speech Communication, 2006; 48: 760-774.

[37] Rabiner, L.R., Schafer, R.W., Digital processing of speech signals (vol. 19). New York: Prentice-hall. 1979.

[38] Markel, J.E., Gray, A.H., Linear prediction of speech. Springer-Verlag New York. 1982.

[39] Ramachandran, R.P., Zilovic, M.S., Mammone, R.J., A comparative study of robust linear predictive analysis methods with applications to speaker identification; IEEE Trans. Speech and Audio Processing, 1995; 3(2): 117-125.

[40] Huang, X., Acero, A., Hon, H.W., Reddy, R., Spoken Language Processing: A Guide to Theory, Algorithm & System Development. 2001.

[41] Atal, B.S., Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification; The Journal of the Acoustical Society of America, 1974; 55, 1304.

[42] Ramamoorthy, V., Jayant, N.S., Cox, R.V., Sondhi, M.M., Enhancement of ADPCM speech coding with backward-adaptive algorithms for postfiltering and noise feedback; IEEE Journal on Selected Areas in Communications, 1988; 6(2): 364-382.

[43] Lee, K.F., Hon, H.W., Reddy, R., An overview of the SPHINX speech recognition system; IEEE Trans. Acoustics, Speech and Signal Processing, 1990; 38(1): 35-45.

[44] Young, S. J., Evermann, G., Gales, M.J.F., Kershaw, D., Moore, G., Odell, J.J., Woodland, P.C., The HTK book (version 3.4). 2006.

[45] Kamiński, M., Determination of transmission patterns in multichannel data; Philosophical Transactions of the Royal Society B: Biological Sciences, 2005; 360(1457): 947-952.

[46] Stock, J.H., Watson, M.W., Vector autoregressions. The Journal of Economic Perspectives, 2001; 15(4): 101-115.

[47] Schlogl, A., A comparison of multivariate autoregressive estimators; Signal Processing, 2006; 86(9): 2426-2429.

[48] Hytti, H., Takalo, R., Ihalainen, H., Tutorial on multivariate autoregressive modeling; Journal of clinical monitoring and computing, 2006; 20(2): 101-108.

[49] Marple, S.L., Digital spectral analysis with applications; Englewood Cliffs, NJ, Prentice-Hall. 1987.

[50] Lindgren, A.C., Johnson, M.T., Povinelli, R.J., Joint frequency domain and reconstructed phase space features for speech recognition; In Proc. ICASSP, Montreal, Canada, 2004; pp. I-533–I-536.

 [51] Shekofteh, Y., Almasganj, F., Goodarzi, M.M., Comparison of linear based feature transformations to improve speech recognition performance; In Proc. 19th Iranian Conference on Electrical Engineering (ICEE), pp. 2011; 1-4.

[52] Cai, D., He, X., Han, J., Zhang, H.J., Orthogonal laplacianfaces for face recognition; IEEE Trans. Image Processing, 2006; 15(11): 3608-3614.

[53] FARSDAT, Persian speech database: <http://catalog.elra.info/product_info.php?products_id=18>.

[54] Bijankhan, M., Sheykhzadegan, J., Roohani, M.R., Zarrintare, R., Ghasemi, S.Z., Ghasedi, M.E., TFarsDat - The telephone farsi speech database; In Proc. EuroSpeech, Geneva, Switzerland, 2003; pp. 1525-1528.

[55] HTK, Hidden Markov Model Toolkit: <http://htk.eng.cam.ac.uk/>

[56] Shekofteh, Y., Almasganj, F., Using linear models of speech trajectory in the reconstructed phase space to extract useful features for speech recognition system; In Proc. Iranian Conf. Biomedical Engineering (ICBME), Tehran, Iran, 2012; pp.233–236