Ankit
Ankit

Reputation: 21

PESQ,STOI score (Speech Quality) for different languages(Non-English)

I want to use the PESQ,STOI scores of some audio data which are in Hindi(Non-English Language). For English, I can find the PESQ algorithm/Code and use the same. Like :-StackOverflow question, Python pesq(PyPI) STOI.

Can we use the same code for Audio in Hindi or other languages to determine the PESQ/STOI scores?

Most of the time, I found PESQ is used for " Evaluation of Speech Quality"(Not specific for English). Also In PESQ score comparison in different languages PESQ score comparison in different languages_2, they have just compared the PESQ score for different languages. They don't use different codes for them.

But there are some papers like A Methodology for Improving PESQ accuracy for Chinese Speech , This was written in the conclusion part:- "In this paper, PESQ is being evaluated to investigate whether consonantal and tonal intelligibility of Chinese speech is being considered in its computation of speech quality. In the two experiments conducted, it was found out that correlation between subjective intelligibility and PESQ's computed quality was low in both noisy and quiet (noiseless) conditions",

and also in Performance Evaluation and Accuracy Upgrading of PESQ in Chinese Environment , the conclusion part mentions:- " Through the result from this large amount subjective test data, it is repeatedly pointed out that the scores from PESQ in Chinese are underestimated, although the Pearson Correlation Coefficient is as high as excepted. PESQ gives a much lower score than the experience from the customers when the voice service is in the middle quality. "

So for other languages(Currently, my case is Hindi, Indian language), should I directly use the normal PESQ method or have to modify this? If I have to modify the same, then any available ideas for some Indian language (or other than English) would be very helpful.

Upvotes: 1

Views: 7924

Answers (1)

Jan
Jan

Reputation: 121

Not sure if you still need advice, just some comments on your questions:

  • STOI is a metric to predict intelligibility of (quite) noisy speech - not speech quality (which is typically evaluated in silence). The underlying subjective tests of this method are intelligibility tests (asking for recognized word/syllables/logatoms, etc.). Even though the source code can be downloaded for free, free usage is only allowed for research purposes, not commercial work. The scope of this metric is rather limited, my suggestion would be not to use it at all.

  • PESQ (ITU-T P.862) is outdated and was superseded ~10 years ago - don't use it anymore! Even the company earning money with selling PESQ licenses does not recommend this method. BTW, similar as for STOI, the legal usage of PESQ is limited even more: The reference code may only be used for testing e.g., platform-dependent implementations. For academic and commercial purposes, a license must be purchased! Since the beginning, especially universities, did not note this at all.

  • Since the source code of PESQ is available on the ITU-T website, people used it for numerous purposes, which it was not designed for (e.g. acoustic paths or noise reduction algorithm). All results you obtain with PESQ nowadays are unusable, because they do not reflect the state-of-the-art which is currently used in industry.

  • The successor and state-of-the-art method for speech quality prediction is POLQA (ITU-T P.863). It was updated recently to version 3.0. The same license aspects as for PESQ apply; since many users abused the (quite relaxed) source code policy of PESQ, a reference implementation is not available anymore, you have to purchase a valid license.

  • Regarding language dependency: speech quality prediction metrics in general may include a inherent bias regarding languages (but also regarding weighting of other possible degradations). Usually this originates from the available training data for such models, which are available in certain languages, from certain labs, containing certain degradations. Thus, shifts as observed the the work you mentioned, are not unusual, especially for unknown/unseen degradations and languages. In particular for standardized testing, it is definitely and strongly not recommended to modify the prediction algorithm in any way. The typical way of taking such shifts into account would be that a mapping function for certain languages or degradations is applied on top of the predicted MOS, i.e. to "transform" the standard model output to a "better" scale.

Upvotes: 5

Related Questions