Squid Game
Squid Game

Reputation: 140

Finding intersection between two dataframes iteratively

I have the following two dataframes and would like to find their intersection.

df1 = pd.DataFrame({"0": [1524, 8788, 9899, 27172],
                   "1": [1333, 4476, 78783, 90832],
                   "2": [2021, 2022, 34522, 38479]})

print(df1)

      0      1      2
0   1524   1333   2021
1   8788   4476   2022
2   9899  78783  34522
3  27172  90832  38479

df2 is a list type with one column '0' which looks like this:

          0
[1123, 2021, 1333, 6636], 
[1245, 2022, 4477, 0], 
[1524, 2023, 1, 27172], 
[2021, 2023, 90832, 38479]

Expected output should be intersection of df1 and df2, for example:

df3 = [2021, 1333],
      [2022],
      [0],
      [90832, 38479]

What I read so far relates to finding intersection for a single list, and not two dataframes with different data types. My end goal is to compute precision which is the intersection of df1 and df2 divide by the total number of my recommendations from df1 , which is 3. Additional note from comments below: The rows are aligned and would be compared pairwise. [0] in df3 does not appear anywhere but could work in case the intersection is 0.

Upvotes: 1

Views: 123

Answers (2)

user7864386
user7864386

Reputation:

Given

df1:

       0      1      2
0   1524   1333   2021
1   8788   4476   2022
2   9899  78783  34522
3  27172  90832  38479

and df2:

                            0
0    [1123, 2021, 1333, 6636]
1       [1245, 2022, 4477, 0]
2      [1524, 2023, 1, 27172]
3  [2021, 2023, 90832, 38479]

You can use set.intersection inside list comprehension:

df1_lst = df1.to_numpy().tolist()
df2_lst = df2.to_numpy().tolist()
df3 = pd.DataFrame([[list(set(i).intersection(j[0]))] for i,j in zip(df1_lst, df2_lst)], columns=['col'])

Output:

              col
0    [1333, 2021]
1          [2022]
2              []
3  [90832, 38479]

Upvotes: 2

wwnde
wwnde

Reputation: 26676

lst=[[1123, 2021, 1333, 6636], 
[1245, 2022, 4477, 0], 
[1524, 2023, 1, 27172], 
[2021, 2023, 90832, 38479]]

s=[set(x)for x in lst]#put list in set

s1=df1.agg(set,1).to_list()#make list of list of row values

[list(x.intersection(y)) for x, y in zip(s, s1)]

out

[[1333, 2021], [2022], [], [90832, 38479]]

Upvotes: 1

Related Questions