Chunking for non-noun phrases in SpaCy

Question

Sorry if this seems like a silly question, but I am still new to Python and SpaCy.

I have a data frame that contains customer complaints. It looks a bit like this:

df = pd.DataFrame( [[1, 'I was waiting at the bus stop and then suddenly the car mounted the pavement'],
                    [2, 'When we got on the bus, we went upstairs but the bus braked hard and I fell'], 
                    [3, 'The bus was clearly in the wrong lane when it crashed into my car']], 
                    columns = ['ID', 'Text'])

If I want to obtain the noun phrases, then I can do this:

def extract_noun_phrases(text):
    return [(chunk.text, chunk.label_) for chunk in nlp(text).noun_chunks]

def add_noun_phrases(df):
    df['noun_phrases'] = df['Text'].apply(extract_noun_phrases)

add_noun_phrases(df)

What about if I want to extract prepositional phrases from the df? So, specifically trying to extract lines like:

at the bus stop
in the wrong lane

I know I am meant to be using subtree for this, but I don't understand how to apply it to my dataset.

Eric McLachlan · Accepted Answer

A prepositional phrase is simply a preposition followed by a noun phrase.

Since you already know how to identify noun phrases using noun_chunks, it may be as simple as checking the token before the noun phrase. If this preceding_token.pos_ is 'ADP' (APD means adposition and a preposition is a type of adposition.)), then you have probably found a prepositional phrase.

Instead of checking pos_, you could check whether preceding_token.dep_ is 'prep' instead. It depends on which elements of the SpaCy pipeline you have enabled, but the results should be similar.

Chunking for non-noun phrases in SpaCy

Answers (1)

Related Questions