JamesArthur
JamesArthur

Reputation: 506

Problem analyzing a doc column in a df with spaCy nlp

After using a amazon review scraper to build this data frame, I called on nlp in order to tokenize and create a new column containing the processed reviews as 'docs'

However, now I am trying to create a pattern in order to analyzing the reviews in the doc column, but I keep getting know matches, which makes me thinking I'm missing one more pre-processing step, or perhaps not pointing the matcher in the right direction.

While the following code executes without any errors, I receive a matches list with 0 - even though I know the word exists in the doc column. The docs for spaCy are still a tad slim, and I'm not too sure the matcher.add is correct, as the one specific in the tutorial

matcher.add("Name_of_List", None, pattern)

returns an error saying that only 2 arguments are required for this class.

Question: What do I need to change to accurately analyze the df doc column for the pattern created?

Thanks!

Full code:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_md')


df = pd.read_csv('paper_towel_US.csv')

#calling on NLP to return processed doc for each review
df['doc'] = [nlp(body) for body in df.body]


# Sum the number of tokens in each Doc
df['num_tokens'] = [len(token) for token in df.doc]



#calling matcher to create pattern
matcher = Matcher(nlp.vocab)


pattern =[{"LEMMA": "love"},
          {"OP":"+"}
          
          ]
matcher.add("QUALITY_PATTERN", [pattern])


def find_matches(doc):
    spans = [doc[start:end] for _, start, end in matcher(doc)]
    for span in spacy.util.filter_spans(spans):
        return ((span.start, span.end, span.text))
    
    
df['doc'].apply(find_matches)
    

df sample for reproduction via df.iloc[596:600, :].to_clipboard(sep=',')

,product,title,rating,body,doc,num_tokens
596,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Awesome!,5,Great towels!,Great towels!,3
597,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Good buy!,5,Love these,Love these,2
598,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Meh,3,"Does not clean countertop messes well. Towels leave a large residue. They are durable, though","Does not clean countertop messes well. Towels leave a large residue. They are durable, though",18
599,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Exactly as Described. Packaged Well and Mailed Promptly,4,Exactly as Described. Packaged Well and Mailed Promptly,Exactly as Described. Packaged Well and Mailed Promptly,9

Upvotes: 1

Views: 515

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626893

You are trying to get the matches from the "df.doc" string with doc = nlp("df.doc"). You need to extract matches from the df['doc'] column instead.

An example solution is to remove doc = nlp("df.doc") and use the nlp = spacy.load('en_core_web_sm'):

def find_matches(doc):
    spans = [doc[start:end] for _, start, end in matcher(doc)]
    for span in spacy.util.filter_spans(spans):
        return ((span.start, span.end, span.text))

>>> df['doc'].apply(find_matches)
0                  None
1    (0, 2, Love these)
2                  None
3                  None
Name: doc, dtype: object

Full code snippet:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')


df = pd.read_csv(r'C:\Users\admin\Desktop\s.txt')

#calling on NLP to return processed doc for each review
df['doc'] = [nlp(body) for body in df.body]


# Sum the number of tokens in each Doc
df['num_tokens'] = [len(token) for token in df.doc]

#calling matcher to create pattern
matcher = Matcher(nlp.vocab)


pattern =[{"LEMMA": "love"},
          {"OP":"+"}
          
          ]
matcher.add("QUALITY_PATTERN", [pattern])

#doc = nlp("df.doc")

#matches = matcher(doc)
def find_matches(doc):
    spans = [doc[start:end] for _, start, end in matcher(doc)]
    for span in spacy.util.filter_spans(spans):
        return ((span.start, span.end, span.text))

print(df['doc'].apply(find_matches))

Upvotes: 1

Related Questions