Reputation: 322

Python (NLTK) - more efficient way to extract noun phrases?

I've got a machine learning task involving a large amount of text data. I want to identify, and extract, noun-phrases in the training text so I can use them for feature construction later on in the pipeline. I've extracted the type of noun-phrases I wanted from text but I'm fairly new to NLTK, so I approached this problem in a way where I can break down each step in list comprehensions like you can see below.

But my real question is, am I reinventing the wheel here? Is there a faster way to do this that I'm not seeing?

import nltk
import pandas as pd

myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)

tokens = [nltk.word_tokenize(i) for i in texts]

tag_list = [nltk.pos_tag(w) for w in tokens]

phrases = [chunkr.parse(sublist) for sublist in tag_list]

leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]

flatten the list of lists of lists of tuples that we've ended up with, into just a list of lists of tuples

leaves = [tupls for sublists in leaves for tupls in sublists]

Join the extracted terms into one bigram

nounphrases = [unigram[0][1]+' '+unigram[1][0] in leaves]

Upvotes: 12

Answers (4)

NeuroMorphing

Reputation: 149

The Constituent-Treelib library, which can be installed via: pip install constituent-treelib does excatly what you are looking for in few lines of code. In order to extract noun (or any other) phrases, perform the following steps.

from constituent_treelib import ConstituentTree

# First, we have to provide a sentence that should be parsed
sentence = "I've got a machine learning task involving a large amount of text data."

# Then, we define the language that should be considered with respect to the underlying models 
language = ConstituentTree.Language.English

# You can also specify the desired model for the language ("Small" is selected by default)
spacy_model_size = ConstituentTree.SpacyModelSize.Medium

# Next, we must create the neccesary NLP pipeline. 
# If you wish, you can instruct the library to download and install the models automatically
nlp = ConstituentTree.create_pipeline(language, spacy_model_size) # , download_models=True

# Now, we can instantiate a ConstituentTree object and pass it the sentence and the NLP pipeline
tree = ConstituentTree(sentence, nlp)

# Finally, we can extract the phrases
tree.extract_all_phrases()

Result...

{'S': ["I 've got a machine learning task involving a large amount of text data ."],
 'PP': ['of text data'],
 'VP': ["'ve got a machine learning task involving a large amount of text data",
  'got a machine learning task involving a large amount of text data',
  'involving a large amount of text data'],
 'NML': ['machine learning'],
 'NP': ['a machine learning task involving a large amount of text data',
  'a machine learning task',
  'a large amount of text data',
  'a large amount',
  'text data']}

If you only want the noun phrases, just pick them out with tree.extract_all_phrases()['NP']

['a machine learning task involving a large amount of text data',
 'a machine learning task',
 'a large amount of text data',
 'a large amount',
 'text data']

Upvotes: 0

Saurabh Yadav

Reputation: 365

The above methods didn't give me the required results. Following is the function that I would suggest

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import re


def get_noun_phrases(text):
    pos = pos_tag(word_tokenize(text))
    count = 0
    half_chunk = ""
    for word, tag in pos:
        if re.match(r"NN.*", tag):
            count+=1
            if count>=1:
                half_chunk = half_chunk + word + " "
        else:
            half_chunk = half_chunk+"---"
            count = 0
    half_chunk = re.sub(r"-+","?",half_chunk).split("?")
    half_chunk = [x.strip() for x in half_chunk if x!=""]
    return half_chunk

Upvotes: 1

Aldorath

Reputation: 45

I suggest referring to this prior thread: Extracting all Nouns from a text file using nltk

They suggest using TextBlob as the easiest way to achieve this (if not the one that is most efficient in terms of processing) and the discussion there addresses your question.

from textblob import TextBlob
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
blob = TextBlob(txt)
print(blob.noun_phrases)

Upvotes: 1

alvas

Reputation: 122168

Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.

With ne_chunk and solution from

[code]:

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

def get_continuous_chunks(text, chunk_func=ne_chunk):
    chunked = chunk_func(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk

df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.', 
                           'Another bar foo Washington DC thingy with Bruce Wayne.']})

df['text'].apply(lambda sent: get_continuous_chunks((sent)))

[out]:

0                   [New York]
1    [Washington, Bruce Wayne]
Name: text, dtype: object

To use the custom RegexpParser :

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)

def get_continuous_chunks(text, chunk_func=ne_chunk):
    chunked = chunk_func(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk


df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.', 
                           'Another bar foo Washington DC thingy with Bruce Wayne.']})


df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))

[out]:

0                  [bar sentence, New York city]
1    [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object

Upvotes: 13

Python (NLTK) - more efficient way to extract noun phrases?

Answers (4)

Related Questions