Reputation: 15273
I have this pandas dataframe :
df
ID Value
0 1 0.33
1 1 0.91
2 1 0.28
3 2 0.36
4 2 0.50
5 3 0.47
6 3 0.98
7 3 0.34
8 3 0.37
I want to group by ID and create 2 new columns :
The output will be like this :
ID values_in values_out
0 1 [0.33, 0.91, 0.28] [0.36, 0.5, 0.47, 0.98, 0.34, 0.37]
1 2 [0.36, 0.5] [0.33, 0.91, 0.28, 0.47, 0.98, 0.34, 0.37]
2 3 [0.47, 0.98, 0.34, 0.37] [0.33, 0.91, 0.28, 0.36, 0.5]
How can I do that knowing that if I use classic groupby
I will automatically exclude the values_out
?
FYI : I do not care about the order in the lists.
Upvotes: 3
Views: 127
Reputation: 15273
I tried all the solutions, but they are not really efficient (the loops are too slow).
My solution is to cross join and remove the duplicates.
df_in = df.groupby('ID')['Value'].apply(list).reset_index(name="values_in")
df_out = pd.merge(df, df, how="cross")
df_out = df_out.loc[df_["ID_x"] != df_["ID_y"]].groupby('ID_x')['Value_y'].apply(list).reset_index(name="values_out")
final_df = pd.merge(df_in, df_out, on="ID", how="full")
Upvotes: 0
Reputation: 71687
Use groupby
on ID
and inside a list comprehension create dictionaries with corresponding keys as ID
, values_in
and values_out
for each grouped frame:
d = [{'ID': k,
'values_in': g['Value'].values,
'values_out': df.loc[df['ID'].ne(k), 'Value'].values}
for k, g in df.groupby('ID')]
df_ = pd.DataFrame(d)
Result:
ID values_in values_out
0 1 [0.33, 0.91, 0.28] [0.36, 0.5, 0.47, 0.98, 0.34, 0.37]
1 2 [0.36, 0.5] [0.33, 0.91, 0.28, 0.47, 0.98, 0.34, 0.37]
2 3 [0.47, 0.98, 0.34, 0.37] [0.33, 0.91, 0.28, 0.36, 0.5]
Timings(Tested with the dataframe containing 100 unique ID
):
df = pd.DataFrame({'ID': np.arange(100).tolist() * 100, 'Value': np.random.randn(10000)})
df.shape
(10000, 2)
%%timeit -n10
d = [{'ID': k,
'values_in': g['Value'].values,
'values_out': df.loc[df['ID'].ne(k), 'Value'].values}
for k, g in df.groupby('ID')]
df_ = pd.DataFrame(d)
71.5 ms ± 4.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n10
new_df = df.groupby("ID")["Value"].apply(list).reset_index(name="values_in")
new_df["values_out"] = new_df["values_in"] @ (1 - np.eye(new_df.shape[0], dtype=int))
204 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 2
Reputation: 863226
Use custom function with filter by ID
values in boolean indexing
:
def f(x):
return {'values_in': list(x), 'values_out': list(df.loc[df['ID'] != x.name, 'Value'])}
df = df.groupby('ID')['Value'].apply(f).unstack()
print (df)
values_in values_out
ID
1 [0.33, 0.91, 0.28] [0.36, 0.5, 0.47, 0.98, 0.34, 0.37]
2 [0.36, 0.5] [0.33, 0.91, 0.28, 0.47, 0.98, 0.34, 0.37]
3 [0.47, 0.98, 0.34, 0.37] [0.33, 0.91, 0.28, 0.36, 0.5]
Upvotes: 0
Reputation: 354
I am using a different example with the same scenario to demonstrate how it could be done:
ids=[1,1,2,2,2,3]
values=[12,13,14,15,12,12]
df = pd.DataFrame({'Id':ids,'values':values})
df
Output:
Id values
0 1 12
1 1 13
2 2 14
3 2 15
4 2 12
5 3 12
Now, you can fetch the values_in and values_out separately, as I have done :
df2 = pd.DataFrame()
for i in df["Id"].unique(): # Iterate through unique values of Ids
in_list = list(df.loc[df['Id']==i, 'values']) # values_in
out_list = list(df.loc[df['Id']!=i, 'values']) #values_out
df2 = df2.append([[i,in_list,out_list]]) #append each records
df2.columns = ["ID","Values_in","Values_out"] #rename columns
Output :
ID Values_in Values_out
0 1 [12, 13] [14, 15, 12, 12]
0 2 [14, 15, 12] [12, 13, 12]
0 3 [12] [12, 13, 14, 15, 12]
If you don't want duplicate values, you can use set instead of list for values_in and values_out
Upvotes: 1
Reputation: 29742
One way using matmul
:
new_df = df.groupby("ID")["Value"].apply(list).reset_index(name="values_in")
new_df["values_out"] = new_df["values_in"] @ (1 - np.eye(new_df.shape[0], dtype=int))
print(new_df)
Output:
ID values_in values_out
0 1 [0.33, 0.91, 0.28] [0.36, 0.5, 0.47, 0.98, 0.34, 0.37]
1 2 [0.36, 0.5] [0.33, 0.91, 0.28, 0.47, 0.98, 0.34, 0.37]
2 3 [0.47, 0.98, 0.34, 0.37] [0.33, 0.91, 0.28, 0.36, 0.5]
Upvotes: 4