Getting wrong answer byy langdetect.detect

I am using both Nltk and Scikit Learn to do some text processing. I have a data set containing of sentences that some of them has explained the situation in French and English(French part is duplicated) which I want to delete french part. Following in one of my sentence:

"quipage de Global Express en provenance deTokyo Japon vers Dorval a d effectuer une remise des gaz sur la piste cause d un probl me de volets Il fut autoris se poser sur la piste Les services d urgence n ont pas t demand s appareil s est pos sans encombre D lai d environ minutes sur l exploitation The crew of Global Express from Tokyo Japan to Dorval had to pull up on Rwy at because of a flap problem It was cleared to land on Rwy Emergency services were not requested The aircraft touched down without incident Delay of about minutes to operations Regional Report of m d y with record s "

I want to remove all words that are in French. I have tried following code so far but the result is not good enough.

x=sentence
x=x.split()
import langdetect      
from langdetect import detect 
for word in x:
lang=langdetect.detect(word)
if lang=='fr':
    print(word)
    x.remove(word)

the following is my output:

l
un
sur
une
oiseaux
avoir
un
le
du
un
est

Is this a good approach? how I can improve its performance in order to reach better results.

Upvotes: 0

Views: 2067

Answers (1)

aab
aab

Reputation: 11484

Language detection usually requires at least a longer sentence to do a decent job. One or two short words is probably not going to be enough. Think about a in Dorval a d effectuer above. Is a by itself French or English? Is Tokyo French?

I'd also double-check whether this library can handle the kind of non-standard French (no accents, no apostrophes, missing letters, etc.) that you have in your data by checking to see what the library detects for longer strings. It's possible the library is only good at figuring out that more standard French is French. For example, d'un problème vs. your data: d un probl me.

See also this question for other approaches where you can restrict the possible set of languages: Python langdetect: choose between one language or the other only

Upvotes: 1

Related Questions