Victor Zuanazzi
Victor Zuanazzi

Reputation: 1974

Vectorized implementation of a table lookup loop SOLVED

I have two pandas dataframes, df_map has the data I am working with and df_4pc is a lookup table. I want to include the latitute and longitude in my df_map by using postcodes.

That is how they look:

>> df_4pc.head()
idx  postcode   woonplaats  latitude    longitude
0   1000    Amsterdam   52.336243   4.869444
1   1001    Amsterdam   52.364240   4.883358
2   1002    Amsterdam   52.364240   4.883358
3   1003    Amsterdam   52.364240   4.883358
4   1005    Amsterdam   52.364240   4.883358
>> df_map.head()
    location_postcode   latitude    longitude
visit_id            
12765996        1000    0.0          0.0
12764909        1005    0.0          0.0
11698683        1002    0.0          0.0
11665112        1003    0.0          0.0
12638508        1005    0.0          0.0

For the task, I tried using df.lookup, but I get a keyrror:

>> df_4pc.lookup(df_map["location_postcode"], ["postcode"]*len(df_map))

KeyError: 'One or more row labels was not found'

However, I made sure that only postcodes present in df_4pc are used in df_map (the other values were discarded). I have an implementation running, but the dataset is quite large and it takes a few hours to run the whole thing with my loop implementation:

for i in tqdm_notebook(df_map.index.tolist()):
    df_map.at[i, "latitude"] = df_4pc[df_4pc["postcode"] == df_map.at[i, "location_postcode"]]["latitude"]
    df_map.at[i, "longitude"] = df_4pc[df_4pc["postcode"] == df_map.at[i, "location_postcode"]]["longitude"]

That is my expected output:

>> df_map.head()
    location_postcode   latitude    longitude
visit_id            
12765996        1000    52.336243   4.869444
12764909        1005    52.364240   4.883358
11698683        1002    52.364240   4.883358
11665112        1003    52.364240   4.883358
12638508        1005    52.364240   4.883358

I am very unsure of why I have this key error. I looked on multiple threads of stack overflow, especially the ones related to df.lookup, but I could not find anything that worked for me.

The final question is: How can I get an efficient implementation for this lookup table?

Solution pandas join does the job in a second for a database of +1M lines.

coords = df_4pc.set_index('postcode')
df_map = df_map.loc[:, ['location_postcode'].copy()].join(coords, on='location_postcode')

Upvotes: 0

Views: 71

Answers (1)

ralex
ralex

Reputation: 388

You haven't provided your data as an MVCE so I can't verify but DataFrame.join should work.

coords = df_4pc.set_index('postcode')
df_map = (df_map.loc[:, ['location_postcode'].copy()
          .join(coords, on='location_postcode'))

Upvotes: 1

Related Questions