Apply function to multiple columns of a groupby object

Question

I have a dataframe that looks like -

   block  lat  lon
0      0  112   50
1      0  112   50
2      0  112   50
3      1  105   20
4      1  105   20
5      2  130   30

and I want to first groupby block and then apply a function to the lat lon columns. eg

df['location_id'] = df.groupby('block').apply(lambda x: get_location_id(x['lat'], x['lon'])

For each lat/lon my function returns an ID. I want a new column that has that ID. When I try above it doesn't work, and the lambda function seems to dislike me using axis=1 or similar. ie -

   block  lat  lon location_id
0      0  112   50  1
1      0  112   50  1
2      0  112   50  1
3      1  105   20  23
4      1  105   20  23
5      2  130   30  15

I'd like to avoid just applying the function to the ungrouped dataframe because my dataset is quite large and that will be slow.

edit: the function returns a string id. Takes in lat lon and returns a single string value.

miriess · Accepted Answer

Depending on how "large" your "large" Dataset is, there might be different solutions... And I'm not 100% certain you can (or should) do what you want with groupby. I'd suggest (and I'm relatively sure that it will work even in distributed environments) to do the following:

Create a new dataframe with non-duplicated block,lat,lon combinations.
Apply your function to that dataframe.
INNER join that dataframe to your original dataframe on those three columns. (While this feels bad, inner joins are usually done via hash joins and are usually quite fast even on spark clusters.)

Edit: Pandas is quite fine with handling datasets that take gigabytes in RAM as long as you have the ram, so just applying the functio might be a lot more viable than you think.

Apply function to multiple columns of a groupby object

Answers (1)

Related Questions