Reputation: 15273

Collect list of values from out of the group by

I have this pandas dataframe :

df 

   ID  Value
0   1   0.33
1   1   0.91
2   1   0.28
3   2   0.36
4   2   0.50
5   3   0.47
6   3   0.98
7   3   0.34
8   3   0.37

I want to group by ID and create 2 new columns :

"values_in" which is a list of value column for the ID
"values_out" which is a list of value column for the other IDs

The output will be like this :


   ID                 values_in                                  values_out
0   1        [0.33, 0.91, 0.28]         [0.36, 0.5, 0.47, 0.98, 0.34, 0.37]
1   2               [0.36, 0.5]  [0.33, 0.91, 0.28, 0.47, 0.98, 0.34, 0.37]
2   3  [0.47, 0.98, 0.34, 0.37]               [0.33, 0.91, 0.28, 0.36, 0.5]

How can I do that knowing that if I use classic groupby I will automatically exclude the values_out?

FYI : I do not care about the order in the lists.

Upvotes: 3

Answers (5)

Steven

Reputation: 15273

I tried all the solutions, but they are not really efficient (the loops are too slow).

My solution is to cross join and remove the duplicates.

df_in = df.groupby('ID')['Value'].apply(list).reset_index(name="values_in")
df_out = pd.merge(df, df, how="cross")
df_out = df_out.loc[df_["ID_x"] != df_["ID_y"]].groupby('ID_x')['Value_y'].apply(list).reset_index(name="values_out")

final_df = pd.merge(df_in, df_out, on="ID", how="full")

Upvotes: 0

Shubham Sharma

Reputation: 71687

Use groupby on ID and inside a list comprehension create dictionaries with corresponding keys as ID, values_in and values_out for each grouped frame:

d = [{'ID': k,
      'values_in': g['Value'].values,
      'values_out': df.loc[df['ID'].ne(k), 'Value'].values}
     for k, g in df.groupby('ID')]
df_ = pd.DataFrame(d)

Result:

   ID                 values_in                                  values_out
0   1        [0.33, 0.91, 0.28]         [0.36, 0.5, 0.47, 0.98, 0.34, 0.37]
1   2               [0.36, 0.5]  [0.33, 0.91, 0.28, 0.47, 0.98, 0.34, 0.37]
2   3  [0.47, 0.98, 0.34, 0.37]               [0.33, 0.91, 0.28, 0.36, 0.5]

Timings(Tested with the dataframe containing 100 unique ID):

df = pd.DataFrame({'ID': np.arange(100).tolist() * 100, 'Value': np.random.randn(10000)})
df.shape
(10000, 2)

%%timeit -n10
d = [{'ID': k,
      'values_in': g['Value'].values,
      'values_out': df.loc[df['ID'].ne(k), 'Value'].values}
     for k, g in df.groupby('ID')]
df_ = pd.DataFrame(d)
71.5 ms ± 4.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n10
new_df = df.groupby("ID")["Value"].apply(list).reset_index(name="values_in")
new_df["values_out"] = new_df["values_in"] @ (1 - np.eye(new_df.shape[0], dtype=int))
204 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 2

jezrael

Reputation: 863226

Use custom function with filter by ID values in boolean indexing:

def f(x):
  return {'values_in': list(x), 'values_out': list(df.loc[df['ID'] != x.name, 'Value'])}


df = df.groupby('ID')['Value'].apply(f).unstack()
print (df)
                   values_in                                  values_out
ID                                                                      
1         [0.33, 0.91, 0.28]         [0.36, 0.5, 0.47, 0.98, 0.34, 0.37]
2                [0.36, 0.5]  [0.33, 0.91, 0.28, 0.47, 0.98, 0.34, 0.37]
3   [0.47, 0.98, 0.34, 0.37]               [0.33, 0.91, 0.28, 0.36, 0.5]

Upvotes: 0

Sadiq Raza

Reputation: 354

I am using a different example with the same scenario to demonstrate how it could be done:

ids=[1,1,2,2,2,3]
values=[12,13,14,15,12,12]
df = pd.DataFrame({'Id':ids,'values':values})
df

Output:

    Id  values
0   1       12
1   1       13
2   2       14
3   2       15
4   2       12
5   3       12

Now, you can fetch the values_in and values_out separately, as I have done :

df2 = pd.DataFrame()
for i in df["Id"].unique(): # Iterate through unique values of Ids
    in_list = list(df.loc[df['Id']==i, 'values']) # values_in
    out_list = list(df.loc[df['Id']!=i, 'values'])  #values_out
    df2 = df2.append([[i,in_list,out_list]]) #append each records
df2.columns = ["ID","Values_in","Values_out"] #rename columns

Output :

    ID      Values_in            Values_out
0   1        [12, 13]      [14, 15, 12, 12]
0   2    [14, 15, 12]          [12, 13, 12]
0   3            [12]  [12, 13, 14, 15, 12]

If you don't want duplicate values, you can use set instead of list for values_in and values_out

Upvotes: 1

Chris

Reputation: 29742

One way using matmul:

new_df = df.groupby("ID")["Value"].apply(list).reset_index(name="values_in")
new_df["values_out"] = new_df["values_in"] @ (1 - np.eye(new_df.shape[0], dtype=int))
print(new_df)

Output:

   ID                 values_in                                  values_out
0   1        [0.33, 0.91, 0.28]         [0.36, 0.5, 0.47, 0.98, 0.34, 0.37]
1   2               [0.36, 0.5]  [0.33, 0.91, 0.28, 0.47, 0.98, 0.34, 0.37]
2   3  [0.47, 0.98, 0.34, 0.37]               [0.33, 0.91, 0.28, 0.36, 0.5]

Upvotes: 4

Collect list of values from out of the group by

Answers (5)

Related Questions