Oleg R
Oleg R

Reputation: 33

Extract two consecutive nouns using spaCy

Here is a simple dataset:

import pandas as pd
product = ['knife', 'box set', 'beautiful jewellery set on sale', 'green']
df = pd.DataFrame(product, columns = ['product_name'])
df

The output as follows:

product_name
0 knife
1 box set
2 beautiful jewellery set on sale
3 green

I want to categorise these products by extracting two consecutive nouns if needed. So far I have the following, but the categories are presented by only one noun for all cases:

!pip install -q --upgrade spacy

import spacy
nlp = spacy.load('en_core_web_sm')

category=[]
for i in df['product_name'].tolist():
    doc = nlp(i)
    for t in doc:
      if t.pos_ in ['NOUN']:
        category.append(f'{t}')
        break
    if t.pos_ not in ['NOUN']:
      category.append('NaN')

df1 = pd.DataFrame(category, columns =['product_category'])
df1

The output I have:

product_category
0 knife
1 set
2 jewellery
3 NaN

The expected output:

product_category
0 knife
1 box set
2 jewellery set
3 NaN

Is it possible to introduce some additional conditions to the code to extract two nouns if they follow one after another?

Upvotes: 3

Views: 368

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

You can use

import spacy
import pandas as pd
import numpy as np

product = ['knife', 'box set', 'beautiful jewellery set on sale', 'green']
df = pd.DataFrame(product, columns = ['product_name'])

nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'POS': 'NOUN'},{'POS': 'NOUN','OP':'?'}]
matcher.add('NOUN_PATTERN', [pattern])

def get_two_nouns(x):
    doc = nlp(x)
    results = []
    for match_id, start, end in matcher(doc):
        span = doc[start:end]
        results.append(span.text)
    return max(results, key = lambda x: len(x.split()), default=np.nan)

df['product_name'].apply(get_two_nouns)

Output:

0            knife
1          box set
2    jewellery set
3              NaN
Name: product_name, dtype: object

The pattern = [{'POS': 'NOUN'},{'POS': 'NOUN','OP':'?'}] pattern matches (combinations of) tokens that are both NOUNs. The second one is optional due to the OP operator set to ?.

The return max(results, key = lambda x: len(x.split()), default=np.nan) part returns the item with the longest length (length measured in whitespace separated token count here).

Upvotes: 2

Related Questions