Barry
Barry

Reputation: 341

Groupby and only selected Columns

Here i Read a file "userdata.xlsx":

ID  Debt    Email   Age User
1   7.5 [email protected]  16  John
2   15  [email protected]  15  John
3   22  [email protected]  15  John
4   30  [email protected] 22  David
5   33  [email protected] 22  David
6   51  [email protected]  61  Fred
7   11  [email protected]  25  Fred
8   24  [email protected]  19  Eric
9   68  [email protected] 55  Terry
10  335 [email protected] 55  Terry

Here I group by User and create a spreadsheet per user and output it as its own .xlsx file like this:

ID  Debt    Email   Age User
1   7.5 [email protected]  16  John
2   15  [email protected]  15  John

Here is the entire code:

    #!/usr/bin/env python3

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import xlrd

    df = pd.read_excel('userdata.xlsx')
    grp = df.groupby('User')

    for group in grp.groups:
        grouptofile = (grp.get_group(group))
        print(grouptofile)
        print(group)
        grouptofile.to_excel('%s.xlsx' % group , sheet_name='sheet1', index=False)

Now I only want to save selected columns to save for each user. Lets say I only want the columns "ID" and "Email" selected. I learned how to select only certain columns like this:

selected = df[['ID','Email']]

I now thought it would make sense to add the ID and Email here.

grp = df.groupby('User')

Added the "ID" and "Email"

grp = df[['ID', 'Email']].groupby('User')

Is it even possible to combine groupby and select columns?

#!/usr/bin/env python3

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import xlrd

    df = pd.read_excel('userdata.xlsx')
    grp = df[['ID', 'Email']].groupby('User')

    for group in grp.groups:
        grouptofile = (grp.get_group(group))
        print(grouptofile)
        print(group)
        grouptofile.to_excel('%s.xlsx' % group , sheet_name='sheet1', index=False)

This is the Error im getting:

Traceback (most recent call last):
  File "/Users/Barry/Documents/Python/Pandas/exelscript.py", line 22, in <module>
    grp = df[['ID', 'Email']].groupby('User')
  File "/usr/local/lib/python3.7/site-packages/pandas/core/generic.py", line 6665, in groupby
    observed=observed, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 2152, in groupby
    return klass(obj, by, **kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 599, in __init__
    mutated=self.mutated)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 3291, in _get_grouper
    raise KeyError(gpr)
KeyError: 'User'

Upvotes: 2

Views: 282

Answers (2)

jezrael
jezrael

Reputation: 862691

I think you need specify columns in subset:

cols = ['ID', 'Email']
for i, group in df.groupby('User'):
    group[cols].to_excel('{}.xlsx'.format(i), sheet_name='sheet1', index=False)

If get KeyError: 'User' it means you want select column which not exist.

So if select columns ID and Email, then chained groupby cannot find User column and raise error:

print (df[['ID', 'Email']])
   ID            Email
0   1   [email protected]
1   2   [email protected]
2   3   [email protected]
3   4  [email protected]
4   5  [email protected]
5   6   [email protected]
6   7   [email protected]
7   8   [email protected]
8   9  [email protected]
9  10  [email protected]

So is necessary select columns also use in groupby:

for i, group in df[['ID', 'Email', 'User']].groupby('User'):
    group.to_excel('{}.xlsx'.format(i), sheet_name='sheet1', index=False)

Or select columns before writing to file like in first solution.

for i, group in df[['ID', 'Email', 'User']].groupby('User'):
    group[cols].to_excel('{}.xlsx'.format(i), sheet_name='sheet1', index=False)

Upvotes: 1

wizzwizz4
wizzwizz4

Reputation: 6426

It's possible... but not quite how you're doing it.

You are effectively removing all of the columns but two, and then trying to group by a third column that doesn't exist any more. Instead, you'll need to group before you select the columns (though I don't know whether grouping in numpy is a mutating operation, so you might need to make a copy first).

(Probably suboptimal) example:

grp = df[('ID', 'Email', 'User')].groupby('User')[('ID', 'Email')]

Upvotes: 0

Related Questions