Reputation: 841
I am attempting to extract this type of information from the following paragraph structure:
women_ran men_ran kids_ran walked
1 2 1 3
2 4 3 1
3 6 5 2
text = ["On Tuesday, one women ran on the street while 2 men ran and 1 child ran on the sidewalk. Also, there were 3 people walking.", "One person was walking yesterday, but there were 2 women running as well as 4 men and 3 kids running.", "The other day, there were three women running and also 6 men and 5 kids running on the sidewalk. Also, there were 2 people walking in the park."]
I am using Python's spaCy
as my NLP library. I am newer to NLP work and am hoping for some guidance as to what would be the best way to extract this tabular information from such sentences.
If it was simply a matter of identifying whether there were individuals running or walking, I would just use sklearn
to fit a classification model, but the information that I need to extract is obviously more granular than that (I am trying to retrieve subcategories and values for each). Any guidance would be greatly appreciated.
Upvotes: 9
Views: 5437
Reputation: 4287
You'll want to use the dependency parse for this. You can see a visualisation of your example sentence using the displaCy visualiser.
You could implement the rules you need a few different ways — much like how there are always multiple ways to write an XPath query, DOM selector, etc.
Something like this should work:
nlp = spacy.load('en')
docs = [nlp(t) for t in text]
for i, doc in enumerate(docs):
for j, sent in enumerate(doc.sents):
subjects = [w for w in sent if w.dep_ == 'nsubj']
for subject in subjects:
numbers = [w for w in subject.lefts if w.dep_ == 'nummod']
if len(numbers) == 1:
print('document.sentence: {}.{}, subject: {}, action: {}, numbers: {}'.format(i, j, subject.text, subject.head.text, numbers[0].text))
For your examples in text
you should get:
document.sentence: 0.0, subject: men, action: ran, numbers: 2
document.sentence: 0.0, subject: child, action: ran, numbers: 1
document.sentence: 0.1, subject: people, action: walking, numbers: 3
document.sentence: 1.0, subject: person, action: walking, numbers: One
Upvotes: 13