PabloG
PabloG

Reputation: 454

How to perform union between sets from different rows at same column at a Dataframe

Which is the best way (fastest) to perform union between sets from different rows at same column of a Dataframe.

For example for the following dataframe:

df_input=pd.DataFrame([[1,{1,2,3}],[1,{11,12}],[2,{1111,2222}],[2,{0,99}]], columns=['name', 'set'])

    name          set
0      1     {1, 2, 3}
1      1      {11, 12}
2      2  {2222, 1111}
3      2       {0, 99}

I would like to get:

    name                  set
0      1    {1, 2, 3, 11, 12}
1      2  {0, 99, 2222, 1111}

And in case I have two columns wiht different sets, how can I join both columns?

For example, for this dataframe:

df_input=pd.DataFrame([[1,{1,2,3},{'a','b'}],[1,{11,12},{'j'}],[2,{1111,2222},{'m','n'}],[2,{0,99},{'p'}]], columns=['name', 'set1', 'set2'])
   name          set1    set2
0     1     {1, 2, 3}  {b, a}
1     1      {11, 12}     {j}
2     2  {2222, 1111}  {m, n}
3     2       {0, 99}     {p}

I am looking for the way to get this as ouput:

   name                 set1       set2
0     1    {1, 2, 3, 11, 12}  {b, j, a}
1     2  {0, 99, 2222, 1111}  {m, p, n}

Thank you.

Upvotes: 2

Views: 269

Answers (1)

Savir
Savir

Reputation: 18428

I am really not very knowleadgable in Pandas, and I'm sure there's a better way and (if you have time) you should probably wait for a better answer, but something like this seems to do the trick?

import pandas as pd
df_input=pd.DataFrame([[1,{1,2,3},{'a','b'}],[1,{11,12},{'j'}],[2,{1111,2222},{'m','n'}],[2,{0,99},{'p'}]], columns=['name', 'set1', 'set2'])

new = pd.DataFrame()
for name, agg_df in df_input.groupby('name'):
    data = {
        'name': name,
        'set1': set(),
        'set2': set(),
    }
    agg_df['set1'].apply(lambda c: data['set1'].update(c))
    agg_df['set2'].apply(lambda c: data['set2'].update(c))
    new = new.append(data, ignore_index=True)

print(new.head())

prints:

   name                 set1       set2
0   1.0    {1, 2, 3, 11, 12}  {b, j, a}
1   2.0  {0, 99, 2222, 1111}  {p, n, m}

There is more Python syntactic sugar that you sure can use, but that's not really pandas...

import pandas as pd
df_input=pd.DataFrame([[1,{1,2,3},{'a','b'}],[1,{11,12},{'j'}],[2,{1111,2222},{'m','n'}],[2,{0,99},{'p'}]], columns=['name', 'set1', 'set2'])

SET_COLUMNS = ('set1', 'set2')
new = pd.DataFrame()
for name, agg_df in df_input.groupby('name'):
    data = {**{'name': name}, **{set_col: set() for set_col in SET_COLUMNS}}
    for set_col in SET_COLUMNS:
        agg_df[set_col].apply(lambda c: data[set_col].update(c))
    new = new.append(data, ignore_index=True)

print(new.head())

Upvotes: 1

Related Questions