singhalc
singhalc

Reputation: 353

How to identify the subject of a sentence?

Can Python + NLTK be used to identify the subject of a sentence? From what I have learned till now is that a sentence can be broken into a head and its dependents. For e.g. "I shot an elephant". In this sentence, I and elephant are dependents to shot. But How do I discern that the subject in this sentence is I.

Upvotes: 22

Views: 38408

Answers (7)

Abd Hendi
Abd Hendi

Reputation: 1

code using spacy : here the doc is the sentence and dep='nsubj' for subject and 'dobj' for object

import spacy 
nlp = spacy.load('en_core_web_lg')


def get_subject_object_phrase(doc, dep):
    doc = nlp(doc)
    for token in doc:
        if dep in token.dep_:
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
    return str(doc[start:end])

Upvotes: 0

ccpizza
ccpizza

Reputation: 31686

rake_nltk (pip install rake_nltk) is a python library that wraps nltk and apparently uses the RAKE algorithm.

from rake_nltk import Rake

rake = Rake()

kw = rake.extract_keywords_from_text("Can Python + NLTK be used to identify the subject of a sentence?")

ranked_phrases = rake.get_ranked_phrases()

print(ranked_phrases)

# outputs the keywords ordered by rank
>>> ['used', 'subject', 'sentence', 'python', 'nltk', 'identify']

By default the stopword list from nltk is used. You can provide your custom stopword list and punctuation chars by passing them in the constructor:

rake = Rake(stopwords='mystopwords.txt', punctuations=''',;:!@#$%^*/\''')

By default string.punctuation is used for punctuation.

The constructor also accepts a language keyword which can be any language supported by nltk.

Upvotes: 1

krish___na
krish___na

Reputation: 722

Stanford Corenlp Tool can also be used to extract Subject-Relation-Object information of a sentence.

Attaching screenshot of same:

enter image description here

Upvotes: 0

cronos
cronos

Reputation: 1

You can paper over the issue by doing something like doc = nlp(text.decode('utf8')), but this will likely bring you more bugs in future.

Credits: https://github.com/explosion/spaCy/issues/380

Upvotes: -3

Sohel Khan
Sohel Khan

Reputation: 661

You can use Spacy.

Code

import spacy
nlp = spacy.load('en')
sent = "I shot an elephant"
doc=nlp(sent)

sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj") ]

print(sub_toks) 

Upvotes: 31

rishi
rishi

Reputation: 2554

English language has two voices: Active voice and passive voice. Lets take most used voice: Active voice.

It follows subject-verb-object model. To mark the subject, write a rule set with POS tags. Tag the sentence I[NOUN] shot[VERB] an elephant[NOUN]. If you see the first noun is subject, then there is a verb and then there is an object.

If you want to make it more complicated, a sentence- I shot an elephant with a gun. Here the prepositions or subordinate conjunctions like with, at, in can be given roles. Here the sentence will be tagged as I[NOUN] shot[VERB] an elephant[NOUN] with[IN] a gun[NOUN]. You can easily say that word with gets instrumentative role. You can build a rule based system to get role of every word in the sentence.

Also look at the patterns in passive voice and write rules for the same.

Upvotes: 6

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

As NLTK book (exercise 29) says, "One common way of defining the subject of a sentence S in English is as the noun phrase that is the child of S and the sibling of VP."

Look at tree example: indeed, "I" is the noun phrase that is the child of S that is the sibling of VP, while "elephant" is not.

Upvotes: 15

Related Questions