Mo Kamyab
Mo Kamyab

Reputation: 27

Manipulating groupby object

I have searched through the other questions but none of them addressed this problem. The focus of this issue is to manipulate the groups directly.

Let's assume I have the following data frame:

     A  B  C    Bg
0    1  X  1  None
1    2  A  7  None
2    3  X  9     1
3    4  X  1     1
4    5  B  1  None
5    6  X  0  None
6    7  C  8  None
7    8  A  5  None
8    9  X  9     2
9   10  X  4     2
10  11  X  2     2
11  12  A  4  None

It is then grouped by 'Bg' column:

groups = df2.groupby('Bg')
for name, group in groups:
    print('name:', name, '\n', group, '\n\n')

the groups will be like this:

name: 1 
    A  B  C Bg
2  3  X  9  1
3  4  X  1  1 


name: 2 
      A  B  C Bg
8    9  X  9  2
9   10  X  4  2
10  11  X  2  2 

I wrote the following code to perform some tasks and manipulate the groups:

groups3 = copy.deepcopy(groups)
for name, group in groups3:
    idx_first = group.index[0]
    idx_last = group.index[-1]
    if name == 2:      
        groups3.groups[name] = np.delete(groups3.groups[name], range(0, 1), axis=0)
    else:
        del groups3.groups[name]
print('groups', groups3.groups)

print('-------')
for name, group in groups3:
    print(group)

and the output is:

groups {2: Int64Index([9, 10], dtype='int64')}
-------
   A  B  C Bg
2  3  X  9  1
3  4  X  1  1
     A  B  C Bg
8    9  X  9  2
9   10  X  4  2
10  11  X  2  2

However, I'm expecting this in the output:

groups {2: Int64Index([9, 10], dtype='int64')}
-------
     A  B  C Bg
9   10  X  4  2
10  11  X  2  2

Upvotes: 2

Views: 881

Answers (1)

piRSquared
piRSquared

Reputation: 294546

This is a serious messy rabbit hole...

enter image description here

Short Story
The iteration through a groupby object isn't controlled by iterating through the dictionary returned by groups

It starts with def __iter__

def __iter__(self):
    """
    Groupby iterator

    Returns
    -------
    Generator yielding sequence of (name, subsetted object)
    for each group
    """
    return self.grouper.get_iterator(self.obj, axis=self.axis)

Then to def get_iterator

def get_iterator(self, data, axis=0):
    """
    Groupby iterator

    Returns
    -------
    Generator yielding sequence of (name, subsetted object)
    for each group
    """
    splitter = self._get_splitter(data, axis=axis)
    keys = self._get_group_keys()
    for key, (i, group) in zip(keys, splitter):
        yield key, group

Which references _get_splitter and _get_group_keys

In both of these, we see group_info which returns an obscure and well protected tuple of things that control the iteration. I couldn't figure out how to completely control the iteration but I could mess it up.

a, b, c = groups3.grouper.group_info
a[a==1] = -1

for name, group in groups3:
    print(group)

   A  B  C Bg
2  3  X  9  1
3  4  X  1  1
Empty DataFrame
Columns: [A, B, C, Bg]
Index: []

My advice... Don't Do This!

Option 1
filter then groupby again

df2.groupby('Bg').filter(lambda x: x.name != '2').groupby('Bg')

Option 2
dictionary comprehension

{name: group for name, group in groups3 if name != '2'}

Upvotes: 3

Related Questions