Reputation: 454
Which is the best way (fastest) to perform union between sets from different rows at same column of a Dataframe.
For example for the following dataframe:
df_input=pd.DataFrame([[1,{1,2,3}],[1,{11,12}],[2,{1111,2222}],[2,{0,99}]], columns=['name', 'set'])
name set
0 1 {1, 2, 3}
1 1 {11, 12}
2 2 {2222, 1111}
3 2 {0, 99}
I would like to get:
name set
0 1 {1, 2, 3, 11, 12}
1 2 {0, 99, 2222, 1111}
And in case I have two columns wiht different sets, how can I join both columns?
For example, for this dataframe:
df_input=pd.DataFrame([[1,{1,2,3},{'a','b'}],[1,{11,12},{'j'}],[2,{1111,2222},{'m','n'}],[2,{0,99},{'p'}]], columns=['name', 'set1', 'set2'])
name set1 set2
0 1 {1, 2, 3} {b, a}
1 1 {11, 12} {j}
2 2 {2222, 1111} {m, n}
3 2 {0, 99} {p}
I am looking for the way to get this as ouput:
name set1 set2
0 1 {1, 2, 3, 11, 12} {b, j, a}
1 2 {0, 99, 2222, 1111} {m, p, n}
Thank you.
Upvotes: 2
Views: 269
Reputation: 18428
I am really not very knowleadgable in Pandas, and I'm sure there's a better way and (if you have time) you should probably wait for a better answer, but something like this seems to do the trick?
import pandas as pd
df_input=pd.DataFrame([[1,{1,2,3},{'a','b'}],[1,{11,12},{'j'}],[2,{1111,2222},{'m','n'}],[2,{0,99},{'p'}]], columns=['name', 'set1', 'set2'])
new = pd.DataFrame()
for name, agg_df in df_input.groupby('name'):
data = {
'name': name,
'set1': set(),
'set2': set(),
}
agg_df['set1'].apply(lambda c: data['set1'].update(c))
agg_df['set2'].apply(lambda c: data['set2'].update(c))
new = new.append(data, ignore_index=True)
print(new.head())
prints:
name set1 set2
0 1.0 {1, 2, 3, 11, 12} {b, j, a}
1 2.0 {0, 99, 2222, 1111} {p, n, m}
There is more Python syntactic sugar that you sure can use, but that's not really pandas...
import pandas as pd
df_input=pd.DataFrame([[1,{1,2,3},{'a','b'}],[1,{11,12},{'j'}],[2,{1111,2222},{'m','n'}],[2,{0,99},{'p'}]], columns=['name', 'set1', 'set2'])
SET_COLUMNS = ('set1', 'set2')
new = pd.DataFrame()
for name, agg_df in df_input.groupby('name'):
data = {**{'name': name}, **{set_col: set() for set_col in SET_COLUMNS}}
for set_col in SET_COLUMNS:
agg_df[set_col].apply(lambda c: data[set_col].update(c))
new = new.append(data, ignore_index=True)
print(new.head())
Upvotes: 1