Lydon Ch
Lydon Ch

Reputation: 8815

Dictionary-based NLTK tagger

I want to tag location string in text using NLTK and also in Stanford-NLP

and am looking for dictionary lookup tagger for NLTK/Stanford-NLP, for so far I haven't found anything with Dictionary-lookup method.

One way is to use RegexpTagger(NLTK) and supply every location strings in there, but it might slow.

I don't need to do any semantic analysis, other than to tag the locations based on my location-dictionary.

Ideas ?

Upvotes: 4

Views: 1507

Answers (2)

seagulf
seagulf

Reputation: 378

If all you need is to look up from dictionaries, then htql.RegEx() may be a good fit. Here is the example from http://htql.net:

import htql; 
address = '88-21 64th st , Rego Park , New York 11374'
states=['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 
    'Delaware', 'District Of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 
    'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 
    'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 
    'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 
    'Oregon', 'PALAU', 'Pennsylvania', 'PUERTO RICO', 'Rhode Island', 'South Carolina', 'South Dakota', 
    'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 
    'Wyoming']; 

a=htql.RegEx(); 
a.setNameSet('states', states);

state_zip1=a.reSearchStr(address, "&[s:states][,\s]+\d{5}", case=False)[0]; 
# state_zip1 = 'New York 11374'

state_zip2=a.reSearchList(address.split(), r"&[ws:states]<,>?<\d{5}>", case=False)[0]; 
# state_zip2 = ['New', 'York', '11374']

You can use parameter: useindex=True to return matching positions.

Upvotes: 0

Quentin Pradet
Quentin Pradet

Reputation: 4771

You could use UnigramTagger:

#!/usr/bin/env python2

from nltk.tag.sequential import UnigramTagger
from nltk.tokenize import word_tokenize, sent_tokenize

text = 'I visited Paris and Bordeaux. Not Los Angeles'

locations = [[('Paris', 'LOC'), ('Bordeaux', 'LOC'), ('France', 'LOC'),
              ('Los Angeles', 'LOC')]]    
location_tagger = UnigramTagger(locations)

for sentence in sent_tokenize(text):
    tokens = word_tokenize(sentence)
    print(location_tagger.tag(tokens))

Prints:

[('I', None), ('visited', None), ('Paris', 'LOC'), ('and', None),
 ('Bordeaux', 'LOC'), (',', None), ('but', None), ('not', None),
 ('Los', None), ('Angeles', None)]

You will need a better tokenizer if you want to tag multi-word locations like Los Angeles.

Upvotes: 2

Related Questions