Kevin
Kevin

Reputation: 447

Intersection of sets as columns in pandas

I have a df such as:

df=pd.DataFrame.from_items([('i', [set([1,2,3,4]), set([1,2,3,4]), set([1,2,3,4]),set([1,2,3,4])]), ('j', [set([2,3]), set([1]), set([4]),set([3,4])])])

so it looks like

>>> df
              i       j
0  {1, 2, 3, 4}  {2, 3}
1  {1, 2, 3, 4}     {1}
2  {1, 2, 3, 4}     {4}
3  {1, 2, 3, 4}  {3, 4}

I would like to compute df.i.intersection(df.j) and assign that to be column k. That is, I want this:

df['k']=[df.i.iloc[t].intersection(df.j.iloc[t]) for t in range(4)]

>>> df.k
0    {2, 3}
1       {1}
2       {4}
3    {3, 4}
Name: k, dtype: object

Is there a df.apply() for this? The actual df is millions of rows.

Upvotes: 6

Views: 3829

Answers (2)

jezrael
jezrael

Reputation: 862711

Working with sets, lists and dicts in pandas is a bit problematic, because best working with scalars:

df['k'] = [x[0] & x[1] for x in zip(df['i'], df['j'])]
print (df)
              i       j       k
0  {1, 2, 3, 4}  {2, 3}  {2, 3}
1  {1, 2, 3, 4}     {1}     {1}
2  {1, 2, 3, 4}     {4}     {4}
3  {1, 2, 3, 4}  {3, 4}  {3, 4}

df['k'] = [x[0].intersection(x[1]) for x in zip(df['i'], df['j'])]
print (df)
              i       j       k
0  {1, 2, 3, 4}  {2, 3}  {2, 3}
1  {1, 2, 3, 4}     {1}     {1}
2  {1, 2, 3, 4}     {4}     {4}
3  {1, 2, 3, 4}  {3, 4}  {3, 4}

Solution with apply:

df['k'] = df.apply(lambda x: x['i'].intersection(x['j']), axis=1)
print (df)
              i       j       k
0  {1, 2, 3, 4}  {2, 3}  {2, 3}
1  {1, 2, 3, 4}     {1}     {1}
2  {1, 2, 3, 4}     {4}     {4}
3  {1, 2, 3, 4}  {3, 4}  {3, 4}

Upvotes: 6

FLab
FLab

Reputation: 7476

You can reproduce the set intersection using set differences. The intersection between A and B is equal to A minus the elements of A that are not in B. (You can symmetrical do it using B).

So, you can use dataframe sub method to operate set differences:

df['k'] = df['i'].sub(df['i'].sub(df['j']))
# df['k'] = df['j'].sub(df['j'].sub(df['i'])) # equivalent

Which gives the expected output:

df
Out[11]: 
              i       j       k
0  {1, 2, 3, 4}  {2, 3}  {2, 3}
1  {1, 2, 3, 4}     {1}     {1}
2  {1, 2, 3, 4}     {4}     {4}
3  {1, 2, 3, 4}  {3, 4}  {3, 4}

Upvotes: 3

Related Questions