Dance Party2
Dance Party2

Reputation: 7536

Pandas Groupby with Lambda and Algorithm

Given this data frame:

import pandas as pd
import jenkspy
f = pd.DataFrame({'BreakGroup':['A','A','A','A','A','A','B','B','B','B','B'],
                 'Final':[1,2,3,4,5,6,10,20,30,40,50]})
    BreakGroup  Final
0         A     1
1         A     2
2         A     3
3         A     4
4         A     5
5         A     6
6         B     10
7         B     20
8         B     30
9         B     40
10        B     50

I'd like to use jenkspy to identify the group, based on natural breaks for 4 groups (classes), to which each value in "Final" within the group "BreakGroup" belongs.

I started out by doing this:

jenks=lambda x: jenkspy.jenks_breaks(f['Final'].tolist(),nb_class=4)
f['Group']=f.groupby(['BreakGroup'])['BreakGroup'].transform(jenks)

...which results in:

BreakGroup
A    [1.0, 10.0, 20.0, 30.0, 50.0]
B    [1.0, 10.0, 20.0, 30.0, 50.0]
Name: BreakGroup, dtype: object

The first problem here, as you may well have surmised, is that it applies the lambda function to the whole column of "Final" scores instead of just those belonging to each group in the Groupby. The second problem is that I need a column designating the correct group (class) membership, presumably by using transform instead of apply.

I then tried this:

jenks=lambda x: jenkspy.jenks_breaks(f['Final'].loc[f['BreakGroup']==x].tolist(),nb_class=4)
f['Group']=f.groupby(['BreakGroup'])['BreakGroup'].transform(jenks)

...but was promptly beaten back into submission:

ValueError: Can only compare identically-labeled Series objects

Update:

Here is the desired result. The "Result" column contains the upper limit of the group for the respective value from "Final" per group "BreakGroup":

    BreakGroup  Final   Result
0             A     1   2
1             A     2   3
2             A     3   4
3             A     4   4
4             A     5   6
5             A     6   6
6             B     10  20
7             B     20  30
8             B     30  40
9             B     40  50
10            B     50  50

Thanks in advance!

My slightly modified application based on accepted solution:

f.sort_values('BreakGroup',inplace=True)
f.reset_index(drop=True,inplace=True)
jenks = lambda x: jenkspy.jenks_breaks(x['Final'].tolist(),nb_class=4)
g = f.set_index('BreakGroup')
g['Groups'] = f.groupby(['BreakGroup']).apply(jenks)
g.reset_index(inplace=True)
groups= lambda x: [gp for gp in x['Groups']]
#'final' value should be > lower and <= upper
upper = lambda x: [gp for gp in x['Groups'] if gp >= x['Final']][0] # or gp == max(x['Groups'])
lower= lambda x: [gp for gp in x['Groups'] if gp < x['Final'] or gp == min(x['Groups'])][-1]
GroupIndex= lambda x: [x['Groups'].index(gp) for gp in x['Groups'] if gp < x['Final'] or gp == min(x['Groups'])][-1]
f['Groups']=g.apply(groups, axis=1)
f['Upper'] = g.apply(upper, axis=1)
f['Lower'] = g.apply(lower, axis=1)
f['Group'] = g.apply(GroupIndex, axis=1)
f['Group']=f['Group']+1

This returns:

  1. The list of group boundaries

  2. The upper boundary pertinent to the value for "Final"

  3. The lower boundary pertinent to the value for "Final"

  4. The group to which the value for "Final" will belong based on logic noted in comments.

Upvotes: 3

Views: 1740

Answers (2)

EFT
EFT

Reputation: 2369

You have jenks defined as a constant in terms of x, your lambda variable, so it doesn't depend on what you feed it with apply or transform. Changing the definition of jenks to

jenks = lambda x: jenkspy.jenks_breaks(x['Final'].tolist(),nb_class=4)

gives

In [315]: f.groupby(['BreakGroup']).apply(jenks)
Out[315]: 
BreakGroup
A         [1.0, 2.0, 3.0, 4.0, 6.0]
B    [10.0, 20.0, 30.0, 40.0, 50.0]
dtype: object

Continuing from this redefinition,

g = f.set_index('BreakGroup')
g['Groups'] = f.groupby(['BreakGroup']).apply(jenks)
g.reset_index(inplace=True)
group = lambda x: [gp for gp in x['Groups'] if gp > x['Final'] or gp == max(x['Groups'])][0]
f['Result'] = g.apply(group, axis=1)

gives

In [323]: f
Out[323]: 
   BreakGroup  Final  Result
0           A      1     2.0
1           A      2     3.0
2           A      3     4.0
3           A      4     6.0
4           A      5     6.0
5           A      6     6.0
6           B     10    20.0
7           B     20    30.0
8           B     30    40.0
9           B     40    50.0
10          B     50    50.0

Upvotes: 3

Parfait
Parfait

Reputation: 107652

Currently, you are passing a series into transform() and not scalar as you intend for the filter condition. Consider indexing for the first value such as x.index[0] as all values are the same in a groupby series. You can even run min(x) or max(x):

lambda x: jenkspy.jenks_breaks(f['Final'].loc[f['BreakGroup']==x.index[0]].tolist(), nb_class=4)

f['Group'] = f.groupby(['BreakGroup'])['BreakGroup'].transform(jenks)

Upvotes: 1

Related Questions