Reputation: 8815
I want to tag location string in text using NLTK and also in Stanford-NLP
and am looking for dictionary lookup tagger for NLTK/Stanford-NLP, for so far I haven't found anything with Dictionary-lookup method.
One way is to use RegexpTagger(NLTK) and supply every location strings in there, but it might slow.
I don't need to do any semantic analysis, other than to tag the locations based on my location-dictionary.
Ideas ?
Upvotes: 4
Views: 1507
Reputation: 378
If all you need is to look up from dictionaries, then htql.RegEx() may be a good fit. Here is the example from http://htql.net:
import htql;
address = '88-21 64th st , Rego Park , New York 11374'
states=['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut',
'Delaware', 'District Of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan',
'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma',
'Oregon', 'PALAU', 'Pennsylvania', 'PUERTO RICO', 'Rhode Island', 'South Carolina', 'South Dakota',
'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin',
'Wyoming'];
a=htql.RegEx();
a.setNameSet('states', states);
state_zip1=a.reSearchStr(address, "&[s:states][,\s]+\d{5}", case=False)[0];
# state_zip1 = 'New York 11374'
state_zip2=a.reSearchList(address.split(), r"&[ws:states]<,>?<\d{5}>", case=False)[0];
# state_zip2 = ['New', 'York', '11374']
You can use parameter: useindex=True to return matching positions.
Upvotes: 0
Reputation: 4771
You could use UnigramTagger
:
#!/usr/bin/env python2
from nltk.tag.sequential import UnigramTagger
from nltk.tokenize import word_tokenize, sent_tokenize
text = 'I visited Paris and Bordeaux. Not Los Angeles'
locations = [[('Paris', 'LOC'), ('Bordeaux', 'LOC'), ('France', 'LOC'),
('Los Angeles', 'LOC')]]
location_tagger = UnigramTagger(locations)
for sentence in sent_tokenize(text):
tokens = word_tokenize(sentence)
print(location_tagger.tag(tokens))
Prints:
[('I', None), ('visited', None), ('Paris', 'LOC'), ('and', None),
('Bordeaux', 'LOC'), (',', None), ('but', None), ('not', None),
('Los', None), ('Angeles', None)]
You will need a better tokenizer if you want to tag multi-word locations like Los Angeles.
Upvotes: 2