Speech processing
Mohammad Bahador Najafi; Mansour Vali
Volume 14, Issue 2 , July 2020, , Pages 97-107
Abstract
After Alzheimer, Parkinson's disease is known as the most common malignant disease of the nervous system. One of the common obstacles of this disease is the expansion of speech disorders. Since the speech production in humans is made by combination of vibration of the vocal cords (phonatory section) ...
Read More
After Alzheimer, Parkinson's disease is known as the most common malignant disease of the nervous system. One of the common obstacles of this disease is the expansion of speech disorders. Since the speech production in humans is made by combination of vibration of the vocal cords (phonatory section) and then passage through the resonator in vocal tract (articulatory section), it is expected that both of these sections to be impaired. In this study, by using a noninvasive method, it is intended to diagnose Parkinson's disease from speech signal of each subject; for this purpose, using 3 sustain vowels in Persian language recorded from 48 people (27 people with Parkinson's disease and 21 healthy people), it has been evaluated to assess the extent of damage to both phonatory and articulatory sections. The phonatory model can include features such as jitter, shimmer, fundamental frequencies, opening and closing cycling time of the glottal pulses. On the other hand, for the articulatory section, features such as first, second, and third formmants, zero crossing rates, MFFCs, and LPC are investigated. In this study, 38 feature categories were extracted and four statistical parameters of mean, standard deviation, skewness and kurtosis were calculated. Genetic Algorithm was used to identify the optimum features. Then, using the SVM, KNN and the Decision Tree classifiers, the optimum extracted features are classified to determine whether a person is patient or healthy. Finally for the main aim of this study, the results of both phonatory and articulatory sections were compared and challenged. The results of this study showed that phonatory features with accuracy of 96.1±1.2% were more useful than articulatory section in diagnosing of Parkinson. Also it was proved that vowel /u/ has more significant role in the diagnosis of Parkinson's disease compared to other vowels by accuracy of 97.6%.
Speech processing
Hamid Azadi; Mohammad Ali Khalil Zade; Mohammad Reza Akbarzade Toutounchi; Hamid Reza Kobravi; Fariborz Rezaei Talab; Seyed Amir Ziafati Bagherzade; Alireza Noei Sarcheshme; Nina Shahsavan Pour
Volume 10, Issue 1 , May 2016, , Pages 41-47
Abstract
In recent years, researchers have tried hardly to diagnose Parkinson's disease through finding its relation with the patient's speech signal. Also, many studies have been performed on determining the intensity of the disease and its relation with vocal impairment measures. In this paper, we aim to assess ...
Read More
In recent years, researchers have tried hardly to diagnose Parkinson's disease through finding its relation with the patient's speech signal. Also, many studies have been performed on determining the intensity of the disease and its relation with vocal impairment measures. In this paper, we aim to assess and compare the ability of extracting different feature sets from speech signal in order to Parkinson's disease diagnosis. Therefore, 132 features were used to measure vocal impairments from the voice signal of individuals vocalizing phoneme /a/. Then, we used RELIEF feature selection method and applied it to Support Vector Machine (SVM) classifier to choose the best feature of each class. A comparison was made between different feature sets, and finally discrimination percent 95.93 was reached to separate patients from the healthy ones using the combination of selected features. Results obtained from this research can be a very important step toward diagnosing Parkinson's disease non-invasively.
Speech processing
Shahla Azizi; Farzad Towhidkhah; Farshad Almasganj
Volume 6, Issue 4 , June 2012, , Pages 257-265
Abstract
In present work, recognition of isolated word has been studied. The purpose of this research is to increase the performance of children’s speech recognizer using Vocal Tract Length Normalization. This recognition system has been created to design a speech therapy software. Recognition of correct ...
Read More
In present work, recognition of isolated word has been studied. The purpose of this research is to increase the performance of children’s speech recognizer using Vocal Tract Length Normalization. This recognition system has been created to design a speech therapy software. Recognition of correct and wrong pronunciation and help children to improve it using some feedbacks are the goals of this software. In test phase, some speech data that are related to correct and incorrect pronunciation of 47 words have been utilized. Four Baseline models have been Trained, one for children, one combined model (females and children) and two for Adults (by exploiting one Persian database). Children’s model was trained and tested with data that have been collected from 38 children (5 to 8 years old). These experiments were implemented in HTK toolkit. Poor performance was improved using VTLN. Improvement of adult’s model was more than children’s model.
Speech processing
Yaser Shekofteh; Farshad Almasganj
Volume 6, Issue 1 , June 2012, , Pages 17-33
Abstract
Recent researches show that nonlinear and chaotic behavior of the speech signal can be studied in the reconstructed phase space (RPS). Delay embedding theorem is a useful tool to study embedded speech trajectories in the RPS. Characteristics of the speech trajectories have rarely used in the practical ...
Read More
Recent researches show that nonlinear and chaotic behavior of the speech signal can be studied in the reconstructed phase space (RPS). Delay embedding theorem is a useful tool to study embedded speech trajectories in the RPS. Characteristics of the speech trajectories have rarely used in the practical speech recognition systems. Therefore, in this paper, a new feature extraction (FE) method is proposed based on parameters of vector AR (VAR) analysis over the speech trajectories. In this method, using filter and reflection matrices obtained from applying VAR analysis on static and dynamic information of the speech trajectory in the RPS, a high-dimensional feature vector can be achieved. Then, different transformation methods are utilized to attain final feature vectors with appropriate dimension. Results of discrete and continuous phoneme recognition over FARSDAT speech corpus show that the efficiency of the proposed FE method is better than other time-domain-based FE methods such as LPC and LPREF.
Speech processing
Ehsan Akafi; Mansour Vali; Negin Moradi
Volume 6, Issue 3 , June 2012, , Pages 119-129
Abstract
Hypernasality is a frequently occurring resonance disorder in children with cleft palate. Generally an operation is necessary to reduce the hypernasality and therefore an assessment of hypernasality is imperative to quantify the effect of the surgery and design the speech therapy sessions which are crucial ...
Read More
Hypernasality is a frequently occurring resonance disorder in children with cleft palate. Generally an operation is necessary to reduce the hypernasality and therefore an assessment of hypernasality is imperative to quantify the effect of the surgery and design the speech therapy sessions which are crucial after surgery. In this study, a new quantitative method is proposed to estimate hypernasality. The proposed method used the fact that an Autoregressive (AR) model for vocal tract system of a patient with hypernasal speech is not accurate; because of the zeros appear in the frequency response of vocal tract system due to existence of extra channel between oral and nasal cavity of these patients. Therefore in our method hypernasality was estimated by a quantity calculated from comparing the distance between the sequences of cepstrum coefficients extracted from AR model and Autoregressive Moving Average (ARMA) model. K-means and Bayes theorem were utilized for finding a threshold value for proposed index to classify the utterances of subjects. We achieved the balanced accuracy up to 82.18% on utterances and 97.72% on subjects. Since the proposed method needs only computer processing of speech data, compare to other clinical methods it is provides a simple evaluation of hypernasality.
Speech processing
Ayoub Daliri; Farzad Towhidkhah; Shahriar Gharibzadeh; Yaser Shekofteh
Volume 2, Issue 2 , June 2008, , Pages 123-129
Abstract
Speech production is one of the most complicated physiological systems including different subsystems. These subsystems must work together in a synchronous manner. One of the important sub-systems is the jaw. Although different models have suggested for jaw, no suitable model has been proposed yet to ...
Read More
Speech production is one of the most complicated physiological systems including different subsystems. These subsystems must work together in a synchronous manner. One of the important sub-systems is the jaw. Although different models have suggested for jaw, no suitable model has been proposed yet to consider the interactions between muscles, bones and nervous system. In this paper, using Spring-Damper-Mass and a nonlinear concept, we introduced a novel model for jaw movement during speech production. Experimental data were used to estimate the model parameters. Computer simulation results showed that the model could generate the jaw movement patterns similar to those observed in physiological behavior. Generality and simplicity of the model are two model features useful for more investigation of the jaw movement in different tasks.
Speech processing
Mohammad Reza Yazdchi; Seyed Ali Seyed Salehi
Volume 1, Issue 3 , June 2007, , Pages 201-213
Abstract
One of the most important challenges in automatic speech recognition is in the case of difference between the training and testing data. To decrease this difference, the conventional methods try to enhance the speech or use the statistical model adaptation. Training the model in different situations ...
Read More
One of the most important challenges in automatic speech recognition is in the case of difference between the training and testing data. To decrease this difference, the conventional methods try to enhance the speech or use the statistical model adaptation. Training the model in different situations is another example of these methods. The success rate in these methods compared to those of cognitive and recognition systems of human beings seems too much primary. In this paper, an inspiration from human beings' recognition system helped us in developing and implementing a new connectionist lexical model. Integration of imputation and classification in a single NN for ASR with missing data was investigated. This can be considered as a variant of multi-task learning because we train the imputation and classification tasks in parallel fashion. Cascading of this model and the acoustic model corrects the sequence of the mined phonemes from the acoustic model to the desirable sequence. This approach was implemented on 400 isolated words of TFARSDAT Database (Actual telephone database). In the best case, the phoneme recognition correction increased in 16.9 percent. Incorporating prior knowledge (high level knowledge) in acoustic-phonetic information (lower level) can improve the recognition. By cascading the lexical model and the acoustic model, the feature parameters were corrected based on the inversion techniques in the neural networks. Speech enhancement by this method had a remarkable effect in the mismatch between the training and testing data. Efficiency of the lexical model and speech enhancement was observed by improving the phonemes' recognition correction in 18 percent compared to the acoustic model.
Speech processing
Mansour Sheykhan
Volume 1, Issue 3 , June 2007, , Pages 227-240
Abstract
In the first version of our Farsi Text-To-Speech (TTS) system, a Recurrent Neural Network (RNN) was used to generate prosody parameters (pitch contour, duration, energy and pause), and a Harmonic + Noise Model (HNM) speech synthesizer was used to concatenate the single units of diphones. To improve the ...
Read More
In the first version of our Farsi Text-To-Speech (TTS) system, a Recurrent Neural Network (RNN) was used to generate prosody parameters (pitch contour, duration, energy and pause), and a Harmonic + Noise Model (HNM) speech synthesizer was used to concatenate the single units of diphones. To improve the performance of TTS, in this paper, two modifications are presented. In the first one is a neural-statistical hybrid model in which RNN plays the role of prosody parameterizer and the combination of decision trees and Gaussian Mixture Models (GMMs) gives the probability distributions of targets and transitions in each context a equivalent cluster. Another modification is about developing a unit selection speech synthesizer in which syllable is selected as the basic synthesis unit and, due to the first modification, an effective unit selection strategy is also conducted. To evaluate the performance of the system, the rating scales presented in the recommendation P.85 of the International Telecommunication Union (ITU) were used and the Mean Opinion Score (MOS) over six scales was achieved as 3.6.