The Effect of Prosodic Features on Performance Speaker Identification

Journal Name:

Publication Year:

Keywords (Original Language):

Abstract (2. Language): 
In this paper, the effect of the prosodic features on the performance of the speaker identification system in the noisy environment is investigated. For this purpose, the prosodic features, formant frequency, signal energy and pitch frequency, and mel frequency cepstrum coefficients (MFCC) are extracted from the speech signal. And then the distribution of the features for each speaker is modeled by Gaussian Mixture Model (GMM). The speaker recognition is performed on the TIMIT and NTIMIT databases. The noisy environment is created using the NOISEX database. The experimental results showed that when first derivative of the energy and the ratio of the formant frequencies (F3/F2) are used in feature vector, the speaker identification error rate decreases. It is also founded particularly that the pitch frequency is the robust feature against noise and distortion in the phone lines.
Abstract (Original Language): 
Bu makalede, bürünsel özniteliklerin gürültü içeren ortamlarda konuşmacı tanıma başarımına etkileri incelenmiştir. Bunun için, formant frekansı, sinyal enerjisi ve perde frekansı bürünsel özellikleri ve mel frekansı kepstrum katsayıları (MFCC) konuşma sinyalinden elde edilmiştir. Daha sonra her bir konuşmacı için özniteliklerin dağılımı Gauss karışım modeli ile modellenmiştir. Konuşmacı tanıma başarımı TIMIT ve NTIMIT veritabanları ile test edilmiştir. Gürültü ortamı NOISEX veritabanı kullanılarak oluşturulmuştur. Deneysel sonuçlar, enerjinin birinci türevi ve formant frekansları oranının (F3/F2), öznitelik vektörleriyle birlikte kullanılmasının konuşmacı tanıma hata oranını azalttığını göstermiştir. Ayrıca perde frekansının, gürültü ve telefon ortamının oluşturduğu bozulmalara karşı gürbüz bir öznitelik olduğu bulunmuştur.



1. Adami, A.G., Mihaescu, R., Reynolds, D.A., Godfrey, J.J., (2003) Modeling prosodic dynamics for speaker
recognition. In: Proc. IEEE Int. Conf. Acoust., Speech Signal Process., Vol. 4, Hong Kong, China. pp. 788–
2. Adami, A.G., Hermansky, H., (2003) Segmentation of speech for speaker and language recognition. In: Proc.
EUROSPEECH, Geneva. pp. 841–844.
3. Aliaa, A. Y., A. S. Ebada and W. H. El Behaidy. (2004) Development of Automatic Speaker Identification
System, 21st National Radio Science Conference.
4. Arcienega, M., A. Drygajlo. (2001) Pitch-dependent GMMs for Text-Independent. Speaker Recognition
Systems. Eurospeech’01, Scandinavia, p. 2821-2824.
5. Atal, B. (1974) Effectiveness of Linear Prediction Characteristics of the Speech wave for Automatic Speaker
Identification and Verification. Journal of the Acoustical Society of America, vol. 55, p. 1304-1312.6. Atal B.. (1972) Automatic speaker recognition based on pitch contours. Journal of the Acoustic Society of
America, 52(6):1687–1697, 1972.
7. Carey M.J., E.S. Parris, H. Lloyd-Thomas, and S. Bennett. (1996) Robust prosodic features for speaker identification.
In Proc. Int. Conf. on Spoken Language Processing (ICSLP 1996), USA, p. 1800–1803.
8. Chen shi-han and Hsiao-chuan wang (2004) Improvement of Speaker Recognition by combining residual and
prosodic features with acoustic features acoustics, speech, and signal processing, 2004. Proceedings.
(ICASSP '04). IEEE International Conference volume: 1, p. 93-96.
9. Claudio, B. and L. P. Ricotti. (1999) Speech Recognition Theory and C++ Implementation. John
WILEY&Sons, Ltd, p. 125-137.
10. Dehak, N., P. Kenny, and P. Dumouchel, (2007) Continuous prosodic features and formant modeling with
joint factor analysis for speaker verification, in Proc. Interspeech, Antwerp.
11. Duman, F. O. Eroğul., Z. Telatar., S. Yetkin. (2005) Uyku İğciklerinin Kısa ve Uzun Dönemli Karma
Analizi. SIU, Kayseri.
12. Ertaş, F. (2001) Feature Selection and Classification Techniques for Recognition. Journal of Engineering
Sciences, No. 1, Pamukkale, p. 47-54.
13. Fant, G. (1960) Acoustic Theory of Speech Production. Mouton & Co., The Hauge.
14. Hamila, R., J. Astola., F. A. Cheikh., M. Gabbouj. and M. Renfors. (1999) Teager Energy and the Ambiguity
Function. IEEE Transactions on Signal Processing, Vol. 47, no. 1. p. 260-261.
15. Hansen, J.H.L., L. Gavidia-Ceballos and J.F. Kaiser. (1998) A Nonlinear Based Speech Feature Analysis
Method with Application to Vocal Fold Pathology Assessment. IEEE Transactions on Biomedical Engineering,
vol. 45, no. 3, p. 300-313.
16. Jabloun, F. A.E. Cetin, (1999) The Teager energy based feature parameters for robust speech recognition in
car noise. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing,
Vol. 1, pp. 273–276.
17. Jankowski, C. R., T. F. Quatieri., D. A. Reynolds. (1994) Formant AM-FM for Speaker Identification. IEEE
Transactions on Speech and Audio Processing, p. 608-611.
18. Jankowski, C. R., T. F. Quatieri., D. A. Reynolds. (1995) Measuring Fine Structure in Speech: Application to
Speaker Identification. IEEE Transactions on Speech and Audio Processing, p. 325-328.
19. Kasi, K. (2002) Yet Another Algorithm for Pitch Tracking, Master thesis, Old Dominion University, p. 9-13.
20. Kinnunen, T., Gonzalez-Hautamaki, R. (2005) Long-Term F0 Modeling for Text-Independent Speaker Recognition.
In: Proceedings of the 10th International Conference Speech and Computer (SPECOM), Patras,
Greece, p. 567–570.
21. Mary L, B. Yegnanarayana. (2008) Extraction and representation of prosodic features for language and speaker
recognition. Speech Communication, Volume 50, Issue 10, p. 782-796.
22. Mezghani, A., and O’Shaughnessy, D., (2005) Speaker Verification Using a New Representation Based on a
Combination of MFCC and Formants, IEEE Canadian Conference on Electrical and Computer Engineering,
Saskatoon, SK, p. 1461-1464.
23. Nwe, T.L. S.W. Foo, and L.C. De Silva. (2003) Detection of stres and emotion in speech using traditional
and FFT based log energy features. In Fourth Pacific Rim Conference on Multimedia, Information, Communications
and Signal Processing, volume 3, pages 1619–1623.
24. O’shaughnessy, D. (1987) Speech Communication Human and Machine. Addison Wesley, New York.
25. Park, A. (2002) ASR Dependent Techniques for Speaker Recognition. Master of Engineering in Electrical
Engineering and Computer Science at the Massachusetts Institute of Technology, USA, p. 65-66.
26. Peskin, Barbara et al. (2003 a) Using Prosodic And Conversational Features for High-Performance Speaker
Recognition. Report from JHU WS’02”, IEEE Trans. Speech Audio Processing, p. 792-796.
27. Peskin, B., A. Adami., Q. Jin., D. Klusácek., J. S. Abramson., R. Mihaescu., J. J. Godfrey, D. A. Jones and B.
Xiang. (2003 b) The Super SID Project: Exploiting High-level Information for High-accuracy Speaker Recognition.
International Conference on Acoustics, Speech, and Signal Processing IEEE, Hong Kong, p. 784-
28. Plumpe, M. D., T. F. Quatieri. and D. A. Reynolds. (1999) Modeling of the Glottal Flow Derivative Waveform
with Application to Speaker Identification. IEEE Transactions on Speech and Audio Processing, vol. 7,
no. 5.
29. Rabiner, L. R. and B. H. Juang. (1993) Fundamentals of Speech Recognition. Prentice Hall, Englewood
Cliffs.30. Reynolds, D.A. (1992) A Gaussian Mixture Modeling Approach to Text Independent Speaker Identification.
Ph.D. thesis, Georgia Inst. of Technology.
31. Reynolds D. A., Zissman M. A., Quatieri T. F., O’Leary G. C., Carlson B. A. (1995) The Effects of Telephone
Transmission Degradations on Speaker Recognition Performance, ICASSP (Detroit), May 9-12. p. 329-
32. Reynolds D.A., et al. (2003) The Super SID Project: Exploiting High-Level Information for High-Accuracy
Speaker Recognition. in Proc. ICASS, p. 784–787.
33. Reynolds, D.A., J. Campbell., B. Campbell., B. Dunn., T. Gleason., D. Jones., T. Quatieri., C. Quillen., D.
Sturim., P. T. Carrasquillo. (2004) Beyond Cepstra: Exploiting High-Level Information in Speaker Recognition.
Super SID Project Final Report, p. 223-229.
34. Rose, P. (2001) Forensic Speaker Identification, Taylor & Francis Forensic Science Series, ISBN 0-415-
27182-7, p. 225-280.
35. Sankar k. Pal and Dwijesh D. M. (1997) Fuzzy Sets and Decision making Approaches in Vowel and Speaker
Recognition IEEE Transactions on Systems, Man, and Cybernetics, pp. 625-629.
36. Sarma, S. and V. Zue, (1997) A Segment-based speaker verification system using SUMMIT, in Proc. Eurospeech,
Rhodes, pp. 843-846.
37. Seddik Hassen, AmeI B. S. Rahmouni and Mounir Sayadi (2004) Text Independent Speaker Recognition
based on the Attack State Formants and Neural Network Classification IEEE International Conference on Industrial
Technology Volume: 3, p. 1649- 1653.
38. Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., Stolcke, A., (2005) Modeling prosody for speaker
recognition. Speech Comm. 46, 455–472.
39. Shriberg, E. (2007) Higher-Level Features in Speaker Recognition, in C. Müller, editor, Speaker Classification
I, vol. 4343 of Lecture Notes in Computer Science/AI. Springer, Berlin.
40. Sonmez, M.K., Shriberg, E., Heck, L., Weintraub, M., (1998) Modeling dynamic prosodic variation for speaker
variation. In: Proc. Int. Conf. Spoken Language Process., Vol. 7, Sydney, Australia. pp. 3189–3192.
41. Stevens, S. and J. Volkman (1940) The Relation of Pitch to Frequency. American Journal of Psychology, vol.
53, p. 329.
42. Slaney, M. (1998) Auditory Toolbox: A MATLAB Toolbox for Auditory Modeling Work Technical Report,
Interval Research Corporation, p. 29-32.
43. Teager, H. M. and S. M. Teager. (1989) Evidence for Nonlinear Sound Production Mechanisms in the Vocal
Tract. in Speech Production and Speech Modeling, W.J. Hardcastle and A. Marchal, Eds., NATO Advanced
Study Institute Series D, Vol. 55, Bonas, France.
44. Tyagi, V., C. Wellekens, (2005) On Desensitizing the Mel-Cepstrum to Spurious Spectral Components for
Robust Speech Recognition, in Acoustics, Speech, and Signal Processing, Proceedings, IEEE International
Conference on, vol. 1, p. 529–532.
45. Umesh, S., L. Cohen and D. Nelson. (1999) Fitting the Mel Scale. IEEE Transactions on Acoustics, Speech
and Signal Processing., p. 217-220.
46. Varga, A.P., Steeneken, H.J.M., Tomlinson, M., Jones, D., (1992) The NOISEX-92 study on the effect of
additive noise on automatic speech recognition. Technical Report, Speech Research Unit, Defence Research
Agency, Malvern, UK.
47. Zeljkovic, I. P. Haffner, B. Amento, and J. Wilpon, (2008) GMM/SVM n-best speaker identification under
mismatch channel conditions, in ICASSP, Las Vegas, USA, pp. 4129–4132.
48. Zhou, G., J. Hansen. and J. F. Kaiser. (2001) A Nonlinear Feature Based Classification of Speech Under
Stress. IEEE Transactions on Speech and audio Processing, vol. 9, no. 3. p. 300-313.

Thank you for copying data from http://www.arastirmax.com