shivam negi
shivam negi

Reputation: 3

Extracting sentence from a dataframe with description column based on a phrase

I have a dataframe with a 'description' column with details about the product. Each of the description in the column has long paragraphs. Like

"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"

How do I locate/extract the sentence which has the phrase "superb product", and place it in a new column?

So for this case the result will be expected output

I have used this,

searched_words=['superb product','SUPERB PRODUCT']


print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any(True for w in word_tokenize(sent) 
                                     if stemmer.stem(w.lower()) in searched_words)]))

The output for this is not suitable. Though it works if I put just one word in " Searched Word" List.

Upvotes: 0

Views: 1423

Answers (2)

Saginus
Saginus

Reputation: 159

There are lot of methods to do that ,@ChootsMagoots gave you the good answer but SPacy is also so efficient, you can simply choose the pattern that will lead you to that sentence, but beofre that, you can need to define a function that will define the sentence here's the code :


import spacy

def product_sentencizer(doc):
    ''' Look for sentence start tokens by scanning for periods only. '''
    for i, token in enumerate(doc[:-2]):  # The last token cannot start a sentence
        if token.text == ".":
            doc[i+1].is_sent_start = True
        else:
            doc[i+1].is_sent_start = False  # Tell the default sentencizer to ignore this token
    return doc

nlp = spacy.load('en_core_web_sm',  disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser")  # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)

matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}] 


matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)         
    print(matched_span.sent)

Upvotes: 1

ChootsMagoots
ChootsMagoots

Reputation: 670

Assuming the paragraphs are neatly formatted into sentences with ending periods, something like:

for index, paragraph in df['column_name'].iteritems(): for sentence in paragraph.split('.'): if 'superb prod' in sentence: print(sentence) df['extracted_sentence'][index] = sentence

This is going to be quite slow, but idk if there's a better way.

Upvotes: 0

Related Questions