Since he can hear more people talk, which

Since speech is an integral part of communication,people with various speech disorders are about to be isolatedin the society. Our work deals with articulation disorder whichis more frequently occurring speech disorder. The difficulty incorrectly pronouncing the phonemes is referred to as articulationdisorder. Most of such pronunciation errors made by children oradults can be corrected with individual training given by speechtherapists. In today’s increasing population speech therapistsneed help of signal processing techniques to attend the needsof increasing patients. Till now in Malayalam, signal processingtechniques have not been implemented for the evaluation ofspeech disorders in order to reduce the effort of speech pathologist.As a preliminary work towards automation of Malayalamarticulation test, in this work we investigate substitution typearticulation disorder with waveform analysis. Then we introduceobjective evaluation of articulation disorder that builds on speechprocessing techniques such as Dynamic time warping (DTW). Toquantify articulation disorder, objective speech quality measuressuch as Itakura-Saito (IS), Log likelihood ratio (LLR), Log arearatio (LAR), Weighted cepstrum distance(WCD), Signal to noiseratio(SNR) etc. are computed between normal and disorderedspeech. Finally these objective measures are combined togetherusing majority voting method to obtain a score. False rejectionratio calculated for these score values shows that these objectivemeasures have good correlation with subjective evaluation results.The four dimensions of the basic form of communicationthat is speech are voice, resonance, articulation and fluency.Abnormality in all of these areas will result in speech disorderslike voice disorders, fluency disorders, articulation disordersetc. A speaker is said to have articulation disorder if he failsto perceive significant contrast between standard phoneme andthe phoneme he produces.As per World Health Organization (WHO) statistics at least3.5% of human population are victims of speech disorder 2.Census of India 2011 shows that about 50, 72,914 Indians havespeech disorders, in Kerala it is 1, 05,366 3. WHO report ondisability in the south-east Asia region 2013 indicates that inIndia speech and hearing impairment is the rank 3 impairment4. Also as the case in several developing countries, in Indiapeople with speech disorders are much less likely to receiveassistive devices than people with other impairments 5.As one of the most commonly occurring speech disorderarticulation disorder can be corrected with proper training. Thecauses of articulation disorder can be biological or environmental.Both children and adults can suffer with this speechsound disorder. The difficulties with articulation disorder canbe categorized as omission (pepa for pepper), substitution(thoda for soda) and distortion (shlip for ship). In this work wemainly concentrate on substitution type of articulation disorderin adults.Familial aggregation of speech sound disorders is a problemfaced by today’s nuclear families. Children of affectedparents are more likely to have articulation disorders 6. Beingaround more people will help a child to develop his languageso that he can hear more people talk, which is missing intoday’s nuclear families. In this scenario analysing speechdisorders of adults and children have equal relevance.Traditional speech therapy for resolving speech problemsinvolve one-on-one or group lessons by speech pathologists(SLPs) which are time consuming and costly. For the caseof articulation disorder therapy must be done frequently toremain effective, which is difficult to satisfy for many reasonsincluding shortage of SLPs, financial limitation etc. Nowadayswe require support of advanced technologies to meet the needsof increasing patients and to reduce the treatment cost. Theexisting systems make use of technology as an aid to identifyand assess the degree of disorder. For example, ComputerizedAssessment of Phonological Processes in Malayalam (CAPPM)is a software developed at AIISH, Mysore for speechtherapy 7. In CAPP-M, speech language pathologist makeuse of pictures, they show it to the client and mark thecorresponding utterance produced just by hearing them tograde the disability. It is time consuming process. It requireshuman intervention and hence semi-automatic in nature. Insteadif a system can automatically recognize the speechuttered by the client and decide which phoneme is having thearticulation disorder, the human effort and the associated errorscan be reduced. Another one is Vagmi Therapy Picture-Word-Articulation Module which also aims to provide computerizedassessment of misarticulation. This module is currently availablein English, Kannada, Telugu, Hindi, Oriya, and Arabic.Till now in Malayalam, signal processing techniques have notbeen fully implemented for the evaluation of speech disordersin order to reduce the effort of speech pathologist.Various signal processing techniques proposed in the literaturesuggests that automated system evaluation using varioussignal features such as Teager Energy Operator (TEO), LinearPredictive Coding (LPC), Mel Frequency Cepstral Coefficients(MFCC), Pitch, Jitter, Shimmer and the first three formantstogether with the bandwidth of the first formant 8 etc. areeffective. The system requires training and testing of variousspeech recordings including both normal and disorderedspeech to produce a result. The evaluation techniques denotedas subjective evaluation techniques require skilled and trainedpersonnel, also several man hours and efforts. Although it isefficient and accurate, the effort shouldered by these peoplemay be reduced by using objective evaluation techniques thatmake use of certain objective measures such as Itakura Saito(IS) measure, Log Likelihood Ratio (LLR), Log Area Ratio(LAR), Segmental SNR measure, and Log Cepstral Distance(LCD) 1.In order to automate articulation disorder test first wehave to identify the phoneme at which disorder occurs. SLPsdo it manually by repeatedly hearing from the patient. Asa preliminary work we can do waveform analysis of thedisordered speech and can identify the disordered phoneme bycomparing it with normal speech signal. After identifying thephoneme at disorder, the degree of disorder can be calculatedusing objective measures. Since duration of speech signaleven for the same utterance by same speaker will not be thesame at different times, we need non linear matching of thedisordered phoneme and normal phoneme by dynamic timewarping(DTW). After finding optimum alignment betweenthese two phonemes using DTW we can calculate the distancebetween normal and disordered speech using objective qualitymeasures such as LAR, LLR, IS, SNR and LCD. To obtaina final score for a particular phoneme we combine all theobjective measures together using majority voting method.Score validation is done by calculating the false rejection rate(FRR).The speech database used in this experiment was collectedlocally in the lab environment from adult speakers. Mainlymisarticulation of three Malayalam letters /bha/, /zha/ and /nja/are studied in this work. For each of the cases normal aswell as disordered speech database was collected. For the firstdisorder (/pha/ for /bha/) eleven normal speech samples andeleven disordered samples were collected. For the second one(/ra/ or /la/ for /zha/) it was six each and for the third type(/na/for /nja/) five each. All the speakers were requested to utterthe corresponding test words three times. The recording wasdone using the free software Wavesurfer with a sampling rateof 16000 and in mono channel format.Figure 1 shows waveform analysis between normal anddisordered speech for the letter /bha/. The misarticulationdetected in this case is uttering /pha/ instead of /bha/. Thisproblem is identified as a regional misarticulation found incertain natives of Kottayam district in Kerala. Figure showswaveforms for the correct word “bharatham” (top) and themispronounced word “pharatham” (bottom). We know thatthe phoneme /bh/ is a plosive, making short puffs of air and are easily identifiable in an audio waveform. But themispronounced phoneme /ph/ is like a fricative and hence noteasy to isolate from the waveform. In that case we will listento the waveform and try to isolate the particular phoneme.The second articulation disorder is observed for the letter/zha/, which is more significant because of its usage and pronunciation.This phoneme exists in the Vedic language whichis the source of Sanskrit. Many people will not pronounce theletter /zha/ properly, which is unique to Tamil and Malayalam.The words with those sound got converted into more easier topronounce sounds like, /ya/,/la/,/ra/ etc. One possible reasonis outside influence. When non residential keralites or nonkeralites try to speak the letter they simplify it by substitutingother easily utterable letters . For example non natives willpronounce “vazhappazham” as “valappalam” or “varapparam”and “Kozhikode” as “Koyikode”. Waveforms for this disorderis shown in figure 2, the top one is for “vazhappazham”and bottom one is “varapparam”. In this case the correct anddisordered sounds are not identifiable by observation, we haveto listen to the waveform and isolate the corresponding letters.Another misarticulation identified is pronouncing the letter/nja/ as /na/. This problem is found in some adults independentof regional or non native background. They utter “oonjaal”as “oonaal”. The waveform analysis for this one is shown inFigure 3. The top speech signal is for the correct pronunciationand bottom one is for mispronunciation.Suitable features are extracted (such as MFCC, LPC etc)from the normal as well as disordered speech. Then matchingthe features of corresponding phones in normal speech anddisordered speech using DTW gives a measure of similarity.Using this matching, the phone boundaries that make thearticulation disorder can be identified. Then using some objectivespeech quality measures such as IS, LLR, LAR etc.,the articulation error can be identified and the score can becomputed. Objective scores are then compared with manualevaluation score to validate its effectiveness.1) Dynamic Time Warping: Since the normal speech andthe disordered speech do not have exactly the same lengthsimple one-to-one comparison of windows from each speechutterance is not possible. There for in this work we use DTW,which is the most straightforward solution for aligning twotime sequences with different lengths.Given two speech patterns,X and Y,these patternscan be represented by a sequence (x1; x2; :::; xTx ) and(y1; y2; :::; yTy ),where xi and yi are the feature vectors. As wehave noted, in general the sequence of xis will not have thesame length as the sequence of yis. In order to determine thedistance between X and Y, given that some distance functiond(x; y) exists, we need a meaningful way to determine how toproperly align the vectors for the comparison. DTW is one waythat such an alignment can be made. We define two warpingfunctions, _x and _y, which transform the indices of the vectorsequences to a normalized time axis, k. Thus we haveThis gives us a mapping from (x1; x2; :::; xTx ) to(x1; x2; :::; xT ) and from (y1; y2; :::; yTy ) to (y1; y2; :::; yT ).With such a mapping, we are able to compute d_(x; y) usingthese warping functions, giving us the total distance betweentwo patterns aswhere m(k)is a path weight and M_ is a normalizationfactor. Thus, all that remains is the specification of the path _indicated in the above equation. The most common techniqueis to specify that _ is the minimum of all possible paths,subjectto certain constraints.2) Objective Quality Measures: Objective speech qualitymeasures are generally calculated from the normal speech andthe disordered speech using some mathematical formula. Itdoes not require human listeners, and so is less expensive andless time consuming. Objective measures are used to get arough estimate of the quality.SNR Measures: Signal-to-Noise Ratio(SNR)is one of theoldest and widely used objective measures. It is mathematicallysimple to calculate, but requires both distorted and undistorted(clean) speech samples. SNR can be calculated as follows:where x(n) is the clean speech, y(n) the distorted speech,and N the number of samples.LP- Based Measures: Speech production process can bemodelled efficiently with a linear prediction (LP) model. Someof the following objective measures use the distance betweentwo sets of linear prediction coefficients (LPC) calculated onthe normal and the disordered speech.1. The Itakura-Saito Distance Measure: The IS distortionmeasure is calculated based on the following equation:where _2x and _2y represent the all-pole gains for thestandard healthy people’s speech and the test patient’s speech.ax and ay are the healthy-speech and patient-speech LPCcoefficient vectors, respectively. Rx is the autocorrelationmatrix for kx(n), where kx(n) is the sampled speech of healthypeople.2. The Log-Likelihood Ratio: LLR is similar to the ISmeasure. While the IS measure incorporates the gain factor,LLR only considers the difference between the general spectralcomponents. The following equation can be used for computingthe LLR:3. The Log-Area Ratio: LAR is a speech quality assessmentmeasure based on the dissimilarity of LPC coefficients betweennormal speech and the disordered speech. LAR uses the reflectioncoefficients to calculate the difference and is expressed bythe following equation.where p is the order of the LPCcoefficients,rx(i)andry(i)are the ith reflection coefficients ofhealthy and patient’s speech signals.Log Cepstrum Distance: It is an estimate of the logspectrumdistance between normal and disordered speech. Cepstrumis calculated by taking the logarithm of the spectrum andconverting back to the time-domain. LCD can be calculated asfollows:where cx and cy are Cepstrum vectors for normal anddisordered speech, and P is the order.C. Subjective Quality MeasuresSubjective Evaluation of speech sound disorder is requiredto validate the scores obtained by the objective measures.Subjective evaluation is done by taking the opinion of a set oflisteners. The listeners are requested to mark a 0 or 1 scorecorresponding to normal and disordered speech played to them.The definition of a good speech sample is left to the listenerto decide. Then final score for each utterance is obtained bytaking the mean opinion score of all listeners. The final scoresimply says whether the speech is normal or disordered.A. Speech DatabaseThe normal and disordered phonemes corresponding to theidentified disorder were selected using the waveform analysisand the features are extracted. Then DTW is applied to thecorresponding phonemes for aligning them in time domain.Figure 5 shows the optimal frame match path between thestandard healthy speech and the disordered speech. Herethe distance between the normal and disordered speech wasmeasured at the identified phoneme boundaries, which helpedto reduce the speaker dependency upto some extend. Theobjective measures are evaluated using the Mel FrequencyCepstrum Coefficients(MFCC) and the LPC coefficients. Inwhich the result obtained from LPC coefficients showed goodcorrelation with subjective scores. The poor performance withMFCC features may be due to mel-scaling process done duringcomputation of MFCC. LPC coefficients were extracted witha order of six only, in order to reduce the speaker dependency.One correctly prompted phoneme from a healthy speaker wasused as the standard phoneme for calculating the objectivemeasures of quality.B. Performance EvaluationThe five distortion measures(IS, LLR, LAR, SNR andLCD) were calculated for each of the three identified disorders.Table 1 shows classification of normal and disordered speechfor the first speech disorder based on five distance measuresand the DTW distance. N1 to N11 denote speech samplesfrom normal speakers and D1 to D11 denote disordered speechsamples. The ideal classification is also given in the table. Irrespective of the ascending order from N1 to N11 or from D1to D11 what is required is all the normal speech samples shouldappear in the first eleven positions in the table 1 and then theeleven disordered samples. None of the distance measures doit exactly correct but minimum error occur in DTW baseddistance where only one disordered sample is misplaced andtherefore false rejection rate (FRR) is only 9.09. IS and LCDexhibit poor performance with a FRR of 36.36. The equationfor evaluating FRR is given by,where TA is the number of phones annotated and recognizedas normal and FR is the number of phones recognizedas disordered when the actual pronunciations are correct.The same analysis for other two speech disorders is shownin tables 2& 3. For the second disorder all the objectivemeasures give same FRR. In case of third disorder DTW andSNR shows poor performance but all other measures providegood results with only one misplaced speech sample. Sincethe phoneme boundaries for /bha/ are correctly isolated inthe waveform for the first disorder the DTW distance givesminimum FRR. For the other two disorders it is not so.In Table 4 the FRR of all the distance measures are listedfor the three speech disorders and the average is found, inwhich the average FRR for the LLR distance measure givesthe minimum value.Finally all the objective distance measures are combinedtogether to obtain a score for each of the normal and disorderedutterances. The combined score was obtained by majority rulevoting method that classifies a speech signal as normal ordisordered based upon the classification given by majority ofthe objective measures that is more than half. The distancemeasure with low FRR has given more weight during thisprocedure. Table 5 gives the FRR obtained for the threedisorder types in the combined method.Automatic evaluation of speech disorder is not an easytask. In this work the disordered phoneme is isolated from the speech signal using waveform analysis. Then spectral featuresare derived from the disordered speech and matched with thatof the corresponding normal phoneme using Dynamic TimeWarping (DTW) to align them in time domain. To quantifythe degree of disorder, objective speech quality measuressuch as Itakura-Saito (IS), Log Likelihood Ratio (LLR), LogArea ratio (LAR) ,Signal to Noise Ratio(SNR), Log CepstrumDistance(CD) etc. were computed between normal and disorderedphonemes. Objective scores were then compared withsubjective evaluation scores to confirm the effectiveness ofobjective measures. The combined objective score gives bettercorrelation with the subjective score.A fully automated articulation test system for Malayalamlanguage can be developed in future by combining thisproposed method with automatic speech recognition (ASR)systems. Instead of manually selecting the phoneme boundariesfrom the waveform, when the disordered speech is given asinput to the ASR it will provide a semantically meaningfultext output along with timestamps of phonemes present in theinput speech.