Visual speech recognition

Several experiments were conducted in order to make visual speech recognition.  Images from the M2VTS database were used in order to make lip location normalization.   This study presents a method which uses visual information from the lips in motion, and is able to normalize the lip location over all lip images. This method is based on a search algorithm of the lip position in which a lip location normalization is integrated in the model training of Hidden Markov Models, HMMs.

In general terms, this method can be described as the following iterative procedure:

  1. Best Location Search : For a training data set, find the best lip location data set in the sense that the likelihood of each word in the data set is the highest for the corresponding HMMs In this step, for each utterance, an algorithm in order to search around the current position for the best location is carried out.  The lip location corresponding of an utterance is changed several times and the likelihood for the corresponding HMMs is calculated.  The lip position with the high likelihood is selected as the best position.

  2. Model Update:  Update the HMM models by the Baum Welch re-estimation algorithm using all training data set having the best location.  In this step, a new set of HMM models is trained, using the last set of HMM models and using all the utterances with the best lip position.   

All the experiments carried out in this study, use the M2VTS database, which is based on the image based method with 1850 utterances and more than 28000 images.  The experiments showed that the proposed method is able to normalize the lip positions effectively and also show that even a prior normalization method, the proposed one is able to normalize much more the lip positions.

Figure 1 shows some examples of some images before and after applying this method.


  Figure 1.  Images before and after applying the proposed method.


As can be seen,  the lip position after applying this method are centered.  Considering that original images presented not centered lip positions, the visual recognition rates could be improved notoriously.   Figure 2, shows the recognition rates given after several iterations were applied.    A recognition rate of 67.7% was obtained for "w N" (no normalization was applied),   71.2% was obtained for "w I" (intensity normalization was only applied).  After applying the lip location normalized process, the recognition rate was increasing up to 76.6% on iteration 3.  


Figure 2.  Recognition results given when the proposed method was applied.


Figure 3 shows the HMMs models before and after this method was applied.  Here we can appreciate that Lip images given by the HMM models are sharper than those without applying this method.



 Figure 3.  Images given by the HMMs states before and after the proposed method was applied.




This site was last updated 07/06/14