dev.e.loper
dev.e.loper

Reputation: 36064

Extracting specific information from text

I'd like to get some data from text file. I've decided to do it using Natural Language Toolkit, but I'm open to suggestions if there is a better way to do this.

Here is an example:

I need a flight from New York NY to San Francisco CA.

From this text, I'd like to get city and state for origin and destination.

Here is what I have so far:

import nltk
from nltk.text import *
from nltk.corpus import PlaintextCorpusReader

def readfiles():    
    corpus_root = 'C:\prototype\emails'
    w = PlaintextCorpusReader(corpus_root, '.*')
    t = Text(w.words())
    print "--- to ----"
    print t.concordance("to")

    print "--- from ----"
    print t.concordance("from")

I can read the text from some input (file in my case) then use concordance method to find all the usages of it. I want to extract the city, state information that comes after 'to' and 'from'.

Question is what is the best way to look at text that is after the instances of 'to' and 'from'?

Upvotes: 2

Views: 621

Answers (1)

Brian
Brian

Reputation: 23

Perhaps you're better off reading the file in line by line?
Then something as simple as:

cityState = dataAfterTo.split(",")
city = cityState[0]
state = cityState[1].split()[0]

Unless you're dealing with user generated content of course.

Upvotes: 1

Related Questions