I want to remove non-English words from a sentence in Python 3.x

Question

I have a bunch of user queries. In it there are certain queries which contain junk characters as well, eg. I work in Google asdasb asnlkasn I need only I work in Google

import nltk
import spacy
import truecase
words = set(nltk.corpus.words.words())
nlp = spacy.load('en_core_web_lg')

def check_ner(word):
    doc = nlp(word)
    ner_list = []
    for token in doc.ents:
        ner_list.append(token.text)
    return ner_list



sent = "I work in google asdasb asnlkasn"
sent = truecase.get_true_case(sent)
ner_list = check_ner(sent)

final_sent = " ".join(w for w in nltk.wordpunct_tokenize(sent)if w.lower() in words or not 
w.isalpha() or w in ner_list)

I tried this but this doesn't remove the characters since ner is detecting google asdasb asnlkasn as Work_of_Art or sometimes asdasb asnlkasn as Person. I had to include ner because words = set(nltk.corpus.words.words()) doesn't have Google, Microsoft, Apple etc or any other NER value in the corpus.

I want to remove non-English words from a sentence in Python 3.x

Answers (1)

Related Questions