Gioelelm
Gioelelm

Reputation: 2775

Transforming statement in interegative sentence with python NLTK

I have a thousands of sentences about events happened in the past. E.g.

sentence1 = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
sentence2 = 'Alfonso VI of Castile captures the Moorish Muslim city of Toledo, Spain.'
sentence3 = 'The Hindu Medang kingdom flourishes and declines.'

I want to transform them into questions of the form:

question1 = 'When were the Knights Templar founded to protect Christian pilgrims in Jerusalem?'
question2 = 'When did Alfonso VI of Castile capture the Moorish Muslim city of Toledo, Spain?'
question3 = 'When did the Hindu Medang kingdom flourish and decline?'

I realize that this is a complex problem and I am ok with a success rate of 80%.

As far as I understand from searches on the web NTLK is the way to go for this kind of problems. I started to try some things but it is the first time I use this library and I cannot go much further than this:

import nltk
question = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
tokens = nltk.word_tokenize(question)
tagged = nltk.pos_tag(tokens)

This sounds like a problem many people must have encountered and solved. Any suggestions?

Upvotes: 2

Views: 2616

Answers (1)

Igor
Igor

Reputation: 1281

NLTK can definitely be the right tool to use here. But the quality of your tokenizer and pos-tagger output depends on your corpus and type of sentences. Also, there is usually not really an out-of-the-box solution to this (afaik), and it requires some tuning. If you don't have very much time to put into this, I doubt that your success rate will even reach 80%.

Having said that; here's a basic list instertion based example that may help you to capture and succesfully convert some of your sentences.

import nltk

question_one = 'The Knights Templar are founded to protect Christian     pilgrims in Jerusalem.'
question_two = 'Alfonso VI of Castile captures the Moorish Muslim city of Toledo, Spain.'

def modify(inputStr):

    tokens = nltk.PunktWordTokenizer().tokenize(inputStr)
    tagged = nltk.pos_tag(tokens)
    auxiliary_verbs = [i for i, w in enumerate(tagged) if w[1] == 'VBP']
    if auxiliary_verbs:
        tagged.insert(0, tagged.pop(auxiliary_verbs[0]))
    else:
        tagged.insert(0, ('did', 'VBD'))
    tagged.insert(0, ('When', 'WRB'))

    return ' '.join([t[0] for t in tagged])

question_one = modify(question_one)
question_two = modify(question_two)

print(question_one)
print(question_two)

Output:

When are The Knights Templar founded to protect Christian pilgrims in Jerusalem.
When did Alfonso VI of Castile captures the Moorish Muslim city of Toledo , Spain.

As you can see, you'd still need to fix correct casing ('The' is still uppercase), 'captures' is in the wrong tense now and you will want to expand on auxiliary_verbs types (probably 'VBP' alone is too limited). But it's a start. Hope this helps!

Upvotes: 7

Related Questions