Anirudh Jayakumar
Anirudh Jayakumar

Reputation: 1224

grouping in pandas data frame

I have a pandas data frame as below. I want to get the list of 'Job_No' for all the combinations of ('User_ID', 'Exec_No')

   User_ID Exec_No Job_No
1:    2      1      1   
2:    2      2      2 
3:    3      2      3
4:    1      2      4
5:    1      1      5
6:    3      2      6
7:    2      2      7
8:    1      1      8

The desired output is another data frame that looks like

  User_ID Exec_No Job_No
1:    2      1      [1]   
2:    2      2      [2,7] 
3:    3      2      [3,6]
4:    1      2      [4]
5:    1      1      [5,8]

How do I do this using a few lines of code?

Also, the data frame is expected to have around a million rows. Therefore the performance is also important.

Upvotes: 0

Views: 2505

Answers (4)

Danial Tz
Danial Tz

Reputation: 1984

How about this way:

df = pd.DataFrame({'User_ID' : [2,2, 3, 1, 1, 3, 2, 1], 'Exec_No': [1, 2, 2, 2, 1, 2, 2, 1], 'Job_No':[1,2,3,4,5,6,7,8]}, columns=['User_ID', 'Exec_No','Job_No'])

df
User_ID Exec_No Job_No
0    2   1   1
1    2   2   2
2    3   2   3
3    1   2   4
4    1   1   5
5    3   2   6
6    2   2   7
7    1   1   8

Let's do the group by:

df2 = df.groupby(['User_ID', 'Exec_No'], sort=False).apply(lambda x: list(x['Job_No']))
df2    
User_ID  Exec_No
2        1             [1]
         2          [2, 7]
3        2          [3, 6]
1        1          [5, 8]
         2             [4]

and put the way you wanted it:

df2.reset_index()

User_ID Exec_No 0
0    2   1   [1]
1    2   2   [2, 7]
2    3   2   [3, 6]
3    1   1   [5, 8]
4    1   2   [4]

Upvotes: 0

Wouter Overmeire
Wouter Overmeire

Reputation: 69096

As a note, if you care about performance, storing lists in a DataFrame is not very efficient. After grouping the data, Job_No values can be accessed immediately, no need to create a new DataFrame (memory !) holding lists of Job_No per (User_Id, Exec_No) pair.

In [21]: df
Out[21]:
   User_ID  Exec_No  Job_No
0        2        1       1
1        2        2       2
2        3        2       3
3        1        2       4
4        1        1       5
5        3        2       6
6        2        2       7
7        1        1       8

In [22]: grouped = df.groupby(['User_ID', 'Exec_No'])

In [23]: grouped.get_group((3, 2))
Out[23]:
   User_ID  Exec_No  Job_No
2        3        2       3
5        3        2       6

In [24]: grouped.get_group((3, 2))['Job_No']
Out[24]:
2    3
5    6
Name: Job_No, dtype: int64

In [25]: list(grouped.get_group((3, 2))['Job_No'])
Out[25]: [3, 6]

Upvotes: 2

Rutger Kassies
Rutger Kassies

Reputation: 64443

This will give a Series in return:

df.groupby(['User_ID', 'Exec_No']).apply(lambda x: x.Job_No.values)

Wrapping it in a Series in the apply returns a DataFrame:

df.groupby(['User_ID', 'Exec_No']).apply(lambda x: pd.Series([x.Job_No.values]))

User_ID Exec_No        
1       1        [5, 8]
        2           [4]
2       1           [1]
        2        [2, 7]
3       2        [3, 6]

It would be nice if the name= of the Series would be used as the resulting column name, but it isnt.

Upvotes: 0

Anirudh Jayakumar
Anirudh Jayakumar

Reputation: 1224

The solution is straight forward.

say if 'df' is the dataframe object, then

grp_df = df.groupby(['User_ID','Exec_No'])
newdf  = grp_df['Job_No']

Upvotes: 0

Related Questions