adamcircle
adamcircle

Reputation: 724

How should I go about geolocating 1,100,000 lines of coordinate information?

Okay, so I'm trying to envision a solution for this. I have a database with over a million lines which includes a city name in the US and a set of coordinates for that city. The problem is that there are multiple cities with the same name: Springfield, NJ and Springfield, MA, for example. So I need to get the state information.

There are also duplicates within the data. There are only about 6500 sets of unique coordinates, so conceivably, I could locate those and then assign them to the other entries in the database. Is this a feasible plan? How would I go about this?

Here are some examples of what entries in this database look like:

2015-09-01 00:00:00,Buffalo,"42.9405299,-78.8697906",10.1016/s0894-7317(12)00840-1,42.9405299,-78.8697906,43.0,-79.0
2015-09-01 00:00:00,New York,"40.7830603,-73.9712488",10.1016/j.jmv.2014.04.008,40.783060299999995,-73.9712488,41.0,-74.0
2015-09-01 00:00:04,Scottsdale,"33.4941704,-111.9260519",10.1016/j.euroneuro.2014.05.008,33.494170399999994,-111.9260519,33.0,-112.0
2015-09-01 00:00:09,Provo,"40.2338438,-111.6585337",10.1016/j.toxac.2014.07.002,40.233843799999995,-111.6585337,40.0,-112.0
2015-09-01 00:00:13,New York,"40.7830603,-73.9712488",10.1016/j.drugalcdep.2014.09.015,40.783060299999995,-73.9712488,41.0,-74.0
2015-09-01 00:00:16,Fremont,"37.5482697,-121.9885719",10.1016/j.ajic.2012.04.160,37.548269700000006,-121.98857190000001,38.0,-122.0
2015-09-01 00:00:24,Provo,"40.2338438,-111.6585337",10.1016/j.chroma.2015.01.036,40.233843799999995,-111.6585337,40.0,-112.0

I am using the geocoder package for geolocation. Here is some code I've written that could handle that:

def convert_to_state(lati, long):
    lat, lon = float(lati), float(long)
    g = geocoder.google([lat, lon], method='reverse')
    state_long, state_short = g.state_long, g.state
    return state_long, state_short

I'm just not sure how to do this. Turns out geocoding is pretty expensive, so using the duplicates is probably the best way forward. Any suggestions for how to accomplish that?

Upvotes: 1

Views: 95

Answers (3)

Joseph Hansen
Joseph Hansen

Reputation: 13329

There's a geo-info service SmartyStreets that has a list tool that processes lists of searches and returns a bunch of information (can upload a spreadsheet or copy and paste). They focus on address validation so they expect search terms to be addresses, however, it can match just zip codes to cities and states. Do you have access to that info?

Here's a link to the demo.

Upvotes: 1

SO44
SO44

Reputation: 1329

I like the hash table idea, but here is an alternative using some pandas stuff:

1) get a unique list of (lat, lon) coords

df['latlon'] = [(x,y) for x,y in zip(df['lati'].tolist(),df['long'].tolist())]
unique_ll = df['latlon'].unique()

2) loop through unique coords and set the state for all equivalent lines

for l in unique_ll:
    df.loc[df['latlon'] == l, 'state'] = convert_to_state(l[0],l[1])

Upvotes: 2

James
James

Reputation: 2731

Almost certainly the best way to avoid doing extra work will be to use a hash table to check if something already had a mapping:

processed_coords = {}
def convert_to_state(lati, long):
    lat, lon = float(lati), float(long)
    if (lat, lon) not in processed_coords:
        g = geocoder.google([lat, lon], method='reverse')
        state_long, state_short = g.state_long, g.state
        processed_coords[(lat,lon)] = (state_long, state_short)
        return state_long, state_short
    else:
        return processed_coords[(lat,lon)]

This way you do a simple O(1) check to see if you already have the data, which isn't much extra calculation at all, and you don't redo the work if you indeed have already done it.

If you're right and there's only 6500 sets of unique coordinates, you should be fine in terms of memory usage for this technique, but if you're wrong and there are more unique ones, you may run into some memory issues if more of those million something are unique.

Upvotes: 3

Related Questions