Reputation: 1224
I have a pandas data frame as below. I want to get the list of 'Job_No' for all the combinations of ('User_ID', 'Exec_No')
User_ID Exec_No Job_No
1: 2 1 1
2: 2 2 2
3: 3 2 3
4: 1 2 4
5: 1 1 5
6: 3 2 6
7: 2 2 7
8: 1 1 8
The desired output is another data frame that looks like
User_ID Exec_No Job_No
1: 2 1 [1]
2: 2 2 [2,7]
3: 3 2 [3,6]
4: 1 2 [4]
5: 1 1 [5,8]
How do I do this using a few lines of code?
Also, the data frame is expected to have around a million rows. Therefore the performance is also important.
Upvotes: 0
Views: 2505
Reputation: 1984
How about this way:
df = pd.DataFrame({'User_ID' : [2,2, 3, 1, 1, 3, 2, 1], 'Exec_No': [1, 2, 2, 2, 1, 2, 2, 1], 'Job_No':[1,2,3,4,5,6,7,8]}, columns=['User_ID', 'Exec_No','Job_No'])
df
User_ID Exec_No Job_No
0 2 1 1
1 2 2 2
2 3 2 3
3 1 2 4
4 1 1 5
5 3 2 6
6 2 2 7
7 1 1 8
Let's do the group by:
df2 = df.groupby(['User_ID', 'Exec_No'], sort=False).apply(lambda x: list(x['Job_No']))
df2
User_ID Exec_No
2 1 [1]
2 [2, 7]
3 2 [3, 6]
1 1 [5, 8]
2 [4]
and put the way you wanted it:
df2.reset_index()
User_ID Exec_No 0
0 2 1 [1]
1 2 2 [2, 7]
2 3 2 [3, 6]
3 1 1 [5, 8]
4 1 2 [4]
Upvotes: 0
Reputation: 69096
As a note, if you care about performance, storing lists in a DataFrame is not very efficient. After grouping the data, Job_No values can be accessed immediately, no need to create a new DataFrame (memory !) holding lists of Job_No per (User_Id, Exec_No) pair.
In [21]: df
Out[21]:
User_ID Exec_No Job_No
0 2 1 1
1 2 2 2
2 3 2 3
3 1 2 4
4 1 1 5
5 3 2 6
6 2 2 7
7 1 1 8
In [22]: grouped = df.groupby(['User_ID', 'Exec_No'])
In [23]: grouped.get_group((3, 2))
Out[23]:
User_ID Exec_No Job_No
2 3 2 3
5 3 2 6
In [24]: grouped.get_group((3, 2))['Job_No']
Out[24]:
2 3
5 6
Name: Job_No, dtype: int64
In [25]: list(grouped.get_group((3, 2))['Job_No'])
Out[25]: [3, 6]
Upvotes: 2
Reputation: 64443
This will give a Series in return:
df.groupby(['User_ID', 'Exec_No']).apply(lambda x: x.Job_No.values)
Wrapping it in a Series in the apply returns a DataFrame:
df.groupby(['User_ID', 'Exec_No']).apply(lambda x: pd.Series([x.Job_No.values]))
User_ID Exec_No
1 1 [5, 8]
2 [4]
2 1 [1]
2 [2, 7]
3 2 [3, 6]
It would be nice if the name=
of the Series would be used as the resulting column name, but it isnt.
Upvotes: 0
Reputation: 1224
The solution is straight forward.
say if 'df' is the dataframe object, then
grp_df = df.groupby(['User_ID','Exec_No'])
newdf = grp_df['Job_No']
Upvotes: 0