Reputation: 4329
I have a database with a "location" field that contains unconstrained user input in the form of a string. I would like to map each entry to either a US state or NULL.
For example:
'Southeastern Massachusetts' -> MA
'Brookhaven, NY' -> NY
'Manitowoc' -> WI
'Blue Springs, MO' -> MO
'A Damp & Cold Corner Of The World.' -> NULL
'Baltimore, Maryland' -> MD
'Indiana' -> IN
I can tolerate some errors but fewer would obviously be better. What's is the best way to go about this?
Upvotes: 0
Views: 32
Reputation: 4329
For posterity: I just threw a bunch of regexps at it, which worked 'pretty alright'.
Upvotes: 0
Reputation: 750
You may use Geonames which provides very large lists of location names with information about them, and is free. String matching (or approximate string matching) would then be probably not too hard to implement in the simplest cases.
One difficulty you'll probably encounter are names which are ambiguous, i.e. have multiple referents (e.g. Washington, is it the state or the city). If multiple indicators are present, you may check their coherence. Otherwise, you may check other words in input, but this is probably risky.
IMO, this is very close to Entity Linking with a posterior search to the closest state considering entities that have been linked.
Upvotes: 1