Reputation: 63
I have a bunch of user queries. In it there are certain queries which contain junk characters as well, eg. I work in Google asdasb asnlkasn
I need only I work in Google
import nltk
import spacy
import truecase
words = set(nltk.corpus.words.words())
nlp = spacy.load('en_core_web_lg')
def check_ner(word):
doc = nlp(word)
ner_list = []
for token in doc.ents:
ner_list.append(token.text)
return ner_list
sent = "I work in google asdasb asnlkasn"
sent = truecase.get_true_case(sent)
ner_list = check_ner(sent)
final_sent = " ".join(w for w in nltk.wordpunct_tokenize(sent)if w.lower() in words or not
w.isalpha() or w in ner_list)
I tried this but this doesn't remove the characters since ner is detecting google asdasb asnlkasn
as Work_of_Art
or sometimes asdasb asnlkasn
as Person.
I had to include ner because words = set(nltk.corpus.words.words())
doesn't have Google, Microsoft, Apple etc or any other NER value in the corpus.
Upvotes: 1
Views: 6377
Reputation: 469
You can use this to identify your non words.
words = set(nltk.corpus.words.words())
sent = "I work in google asdasb asnlkasn"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha())
Try using this. Thanks to @DYZ answer.
However since you said that you need NER for Google, apple etc. and that is causing an incorrect recognition, what you can do is calculate scores for these predictions of NER using beam parse. Then you can use the scores to set a threshold of acceptable value for NER and drop those below it. I believe these meaningless words will get a low probabilistic score for categorization such as person and you can altogether use to drop categories such as Work of art if you don't need them.
An example of using beamparse for scoring:
import spacy
import sys
from collections import defaultdict
nlp = spacy.load(output_dir)
print("Loaded model '%s'" % output_dir)
text = u'I work in Google asdasb asnlkasn'
with nlp.disable_pipes('ner'):
doc = nlp(text)
threshold = 0.2
(beams) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)
entity_scores = defaultdict(float)
for beam in beams:
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for start, end, label in ents:
entity_scores[(start, end, label)] += score
print ('Entities and scores (detected with beam search)')
for key in entity_scores:
start, end, label = key
score = entity_scores[key]
if ( score > threshold):
print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))
It worked in my testing and NER failed to recognize this.
Upvotes: 1