Reputation: 2775
I have a thousands of sentences about events happened in the past. E.g.
sentence1 = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
sentence2 = 'Alfonso VI of Castile captures the Moorish Muslim city of Toledo, Spain.'
sentence3 = 'The Hindu Medang kingdom flourishes and declines.'
I want to transform them into questions of the form:
question1 = 'When were the Knights Templar founded to protect Christian pilgrims in Jerusalem?'
question2 = 'When did Alfonso VI of Castile capture the Moorish Muslim city of Toledo, Spain?'
question3 = 'When did the Hindu Medang kingdom flourish and decline?'
I realize that this is a complex problem and I am ok with a success rate of 80%.
As far as I understand from searches on the web NTLK is the way to go for this kind of problems. I started to try some things but it is the first time I use this library and I cannot go much further than this:
import nltk
question = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
tokens = nltk.word_tokenize(question)
tagged = nltk.pos_tag(tokens)
This sounds like a problem many people must have encountered and solved. Any suggestions?
Upvotes: 2
Views: 2616
Reputation: 1281
NLTK can definitely be the right tool to use here. But the quality of your tokenizer and pos-tagger output depends on your corpus and type of sentences. Also, there is usually not really an out-of-the-box solution to this (afaik), and it requires some tuning. If you don't have very much time to put into this, I doubt that your success rate will even reach 80%.
Having said that; here's a basic list instertion based example that may help you to capture and succesfully convert some of your sentences.
import nltk
question_one = 'The Knights Templar are founded to protect Christian pilgrims in Jerusalem.'
question_two = 'Alfonso VI of Castile captures the Moorish Muslim city of Toledo, Spain.'
def modify(inputStr):
tokens = nltk.PunktWordTokenizer().tokenize(inputStr)
tagged = nltk.pos_tag(tokens)
auxiliary_verbs = [i for i, w in enumerate(tagged) if w[1] == 'VBP']
if auxiliary_verbs:
tagged.insert(0, tagged.pop(auxiliary_verbs[0]))
else:
tagged.insert(0, ('did', 'VBD'))
tagged.insert(0, ('When', 'WRB'))
return ' '.join([t[0] for t in tagged])
question_one = modify(question_one)
question_two = modify(question_two)
print(question_one)
print(question_two)
Output:
When are The Knights Templar founded to protect Christian pilgrims in Jerusalem.
When did Alfonso VI of Castile captures the Moorish Muslim city of Toledo , Spain.
As you can see, you'd still need to fix correct casing ('The' is still uppercase), 'captures' is in the wrong tense now and you will want to expand on auxiliary_verbs types (probably 'VBP' alone is too limited). But it's a start. Hope this helps!
Upvotes: 7