Reputation:
Suppose I have a Python dict of lists like so:
{'Grp': ['2' , '6' , '6' , '5' , '5' , '6' , '6' , '7' , '7' , '6'],
'Nums': ['6.20', '6.30', '6.80', '6.45', '6.55', '6.35', '6.37', '6.36', '6.78', '6.33']}
I can easily group the numbers and group key using itertools.groupby:
from itertools import groupby
for k, l in groupby(zip(di['Grp'], di['Nums']), key=lambda t: t[0]):
print k, [t[1] for t in l]
Prints:
2 ['6.20']
6 ['6.30', '6.80'] # one field, key=6
5 ['6.45', '6.55']
6 ['6.35', '6.37'] # second
7 ['6.36', '6.78']
6 ['6.33'] # third
Note the 6
key is separated into three separate groups or fields .
Now suppose I have the equivalent Pandas DataFrame to my dict (same data, same list order and same keys):
Grp Nums
0 2 6.20
1 6 6.30
2 6 6.80
3 5 6.45
4 5 6.55
5 6 6.35
6 6 6.37
7 7 6.36
8 7 6.78
9 6 6.33
If I use Pandas' groupby I am not seeing how to get group by group iteration. Instead, Pandas groups by key value:
for e in df.groupby('Grp'):
print e
Prints:
('2', Grp Nums
0 2 6.20)
('5', Grp Nums
3 5 6.45
4 5 6.55)
('6', Grp Nums
1 6 6.30
2 6 6.80 # df['Grp'][1:2] first field
5 6 6.35 # df['Grp'][5:6] second field
6 6 6.37
9 6 6.33) # df['Grp'][9] third field
('7', Grp Nums
7 7 6.36
8 7 6.78)
Note are the 6
group keys are bunched together; not separate groups.
My question: Is there an equivalent way to use Pandas' groupby so that 6
, for example, would be in three groups in the same fashion as Python's groupby
?
I tried this:
>>> df.reset_index().groupby('Grp')['index'].apply(lambda x: np.array(x))
Grp
2 [0]
5 [3, 4]
6 [1, 2, 5, 6, 9] # I *could* do a second groupby on this...
7 [7, 8]
Name: index, dtype: object
But it is still grouped by overall Grp
key and I would need to do a second groupby on the nd.array
to split the sub groups of each key out.
Upvotes: 17
Views: 4997
Reputation: 104024
Well, not to be cheeky, but why not just use Python's groupby
on the DataFrame by using iterrows? That is what it is there for:
>>> df
Grp Nums
0 2 6.20
1 6 6.30
2 6 6.80
3 5 6.45
4 5 6.55
5 6 6.35
6 6 6.37
7 7 6.36
8 7 6.78
9 6 6.33
>>> from itertools import groupby
>>> for k, l in groupby(df.iterrows(), key=lambda row: row[1]['Grp']):
print k, [t[1]['Nums'] for t in l]
Prints:
2 ['6.20']
6 ['6.30', '6.80']
5 ['6.45', '6.55']
6 ['6.35', '6.37']
7 ['6.36', '6.78']
6 ['6.33']
To try and make Panda's groupby
act in the way you want is probably asking for so many stacked methods that you won't be able to follow it when you reread in the future.
Upvotes: 6
Reputation: 109626
You basically want to create a new column to index your desired grouping order, and then use that for grouping. You keep the index number the same until the value in Grp
changes.
For your data, you would want something like this:
Grp Nums new_group
0 2 6.20 1
1 6 6.30 2
2 6 6.80 2
3 5 6.45 3
4 5 6.55 3
5 6 6.35 4
6 6 6.37 4
7 7 6.36 5
8 7 6.78 5
9 6 6.33 6
Where you can now group on both new group
and Grp
:
df.groupby(['new_group', 'Grp']).Nums.groups
{(1, 2): [0],
(2, 6): [1, 2],
(3, 5): [3, 4],
(4, 6): [5, 6],
(5, 7): [7, 8],
(6, 6): [9]
I used this method to create the new column:
df['new_group'] = None
for n, grp in enumerate(df.Grp):
if n is 0:
df.new_group.iat[0] = 1
elif grp == df.Grp.iat[n - 1]:
df.new_group.iat[n] = df.new_group.iat[n - 1]
else:
df.new_group.iat[n] = df.new_group.iat[n - 1] + 1
Note that this answer here has the same idea (thanks @ajcr for the link), but in a much more succinct representation:
>>> df.groupby((df.Grp != df.Grp.shift()).cumsum()).Nums.groups
{1: [0], 2: [1, 2], 3: [3, 4], 4: [5, 6], 5: [7, 8], 6: [9]
Upvotes: 2
Reputation: 8906
First you can identify which elements in the Grp
column differ from the previous and get the cumulative sum to form the groups you need:
In [9]:
diff_to_previous = df.Grp != df.Grp.shift(1)
diff_to_previous.cumsum()
Out[9]:
0 1
1 2
2 2
3 3
4 3
5 4
6 4
7 5
8 5
9 6
So you can then do
df.groupby(diff_to_previous.cumsum())
to get the desired groupby object
Upvotes: 22