user2288429
user2288429

Reputation: 57

Python NLP Spacy : improve bi-gram extraction from a dataframe, and with named entities?

I am using Python and spaCy as my NLP library, working on a big dataframe that contains feedback about different cars, which looks like this:

enter image description here

I then created the following function, which applies the Spacy nlp token to each row of my dataframe and extract any [noun + verb], [verb + noun], [adj + noun], [adj+ proper noun] combinations.

def bi_gram(x):
    doc = nlp_token(x)
    result = []
    text = ''
    for i in range(len(doc)):
        j = i+1
        if j < len(doc):
            if (doc[i].pos_ == "NOUN" and doc[j].pos_ == "VERB") or (doc[i].pos_ == "VERB" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "PROPN"):
                text = doc[i].text + " " + doc[j].text
                result.append(text)
        i = i+1
        return result

Then I applied this function to 'lemmatized' column:

df['bi_gram'] = df['lemmatized'].apply(bi_gram)

This is where I have a problem...

  1. This is producing only one bigram per row maximum. How can I tweak the code so that more than one bigram can be extracted and put in a column? (Also are there more linguistic combinations I should try?)

  2. Is there a possibility to find out what people are saying about 'CAR_BRAND' and 'CAR_MODEL' named entities extracted in the 'entities' column? For example 'Cool Porsche' - Some brands or models are made of more than two words so it's tricky to tackle.

I am very new to NLP.. If there is a more efficient way to tackle this, any advice will be super helpful! Many thanks for your help in advance.

Upvotes: 1

Views: 956

Answers (1)

fsimonjetz
fsimonjetz

Reputation: 5802

spaCy has a built-in pattern matching engine that's perfect for your application – it's documented here and in a more extensive usage guide. It allows you to define patterns in a readable and easy-to-maintain way, as lists of dictionaries that define the properties of the tokens to be matched.

Set up the pattern matcher

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm") # or whatever model you choose

matcher = Matcher(nlp.vocab)

# your patterns
patterns = {
    "noun_verb": [{"POS": "NOUN"}, {"POS": "VERB"}],
    "verb_noun": [{"POS": "VERB"}, {"POS": "NOUN"}],
    "adj_noun": [{"POS": "ADJ"}, {"POS": "NOUN"}],
    "adj_propn": [{"POS": "ADJ"}, {"POS": "PROPN"}],
}

# add the patterns to the matcher
for pattern_name, pattern in patterns.items():
    matcher.add(pattern_name, [pattern])

Extract matches

doc = nlp("The dog chased cats. Fast cats usually escape dogs.")
matches = matcher(doc)

matches is a list of tuples containing

  • a match id,
  • the start index of the matched bit and
  • the end index (exclusive).

This is a test output adopted from the spaCy usage guide:

for match_id, start, end in matches:
    
    # Get string representation
    string_id = nlp.vocab.strings[match_id]

    # The matched span
    span = doc[start:end]
    
    print(repr(span.text))
    print(match_id, string_id, start, end)
    print()

Result

'dog chased'
1211260348777212867 noun_verb 1 3

'chased cats'
8748318984383740835 verb_noun 2 4

'Fast cats'
2526562708749592420 adj_noun 5 7

'escape dogs'
8748318984383740835 verb_noun 8 10

Some ideas for improvement

  • Named entity recognition should be able to detect multi-word expressions, so brand and/or model names that consist of more than one token shouldn't be an issue if everything is set up correctly
  • Matching dependency patterns instead of linear patterns might slightly improve your results

That being said, what you're trying to do – kind of sentiment analysis -is quite a difficult task that's normally engaged with machine learning approaches and heaps of training data. So don't expect too much from simple heuristics.

Upvotes: 3

Related Questions