Jason Duncan
Jason Duncan

Reputation: 68

compare df1 column 1 to all columns in df2 returning the index of df2

I'm new to pandas so likely overlooking something but I've been searching and haven't found anything helpful yet.

What I'm trying to do is this. I have 2 dataframes. df1 has only 1 column and an unknown number of rows. df2 has an unknown number of rows also and also an unknown number of columns for each index.

Example:
df1:
0     1117454
1     1147637
2     1148945
3     1149662
4     1151543
5     1151545
6     1236268
7     1236671
8     1236673
...
300   1366962

df2:
                1        2        3        4        5        6        7  
8302813476  1375294  1375297  1375313  1375318  1375325  1375330  1375331   
8302813477  1317422  1363270  1363288  1363262     None     None     None   
8302813478  1187269  1187276  1149662  1147843  1147639  1236650  1236656

So what I want is to check all df1 values against df2 column 1 - n and if there is a match with any value in df1 mark the index of df2 as True else it is False.

Upvotes: 2

Views: 1297

Answers (2)

piRSquared
piRSquared

Reputation: 294536

As an interesting numpy alternative

l1 = df1.values.ravel()
l2 = df2.values.ravel()

pd.DataFrame(
    np.equal.outer(l1, l2).any(0).reshape(df2.values.shape),
    df2.index, df2.columns
)

or using set, list and comprehension

l1 = set(df1.values.ravel().tolist())
l2 = df2.values.ravel().tolist()

pd.DataFrame(
    np.array([bool(l1.intersection([d])) for d in l2]).reshape(df2.values.shape),
    df2.index, df2.columns
)

enter image description here

Upvotes: 3

jezrael
jezrael

Reputation: 863701

I think you can use isin for testing matching of Series created from df2 by stack with Series created from one column df1 by squeeze. Last reshape by unstack:

df3 = df2.stack().isin(df1.squeeze()).unstack()
print (df3)
                1      2      3      4      5      6      7
8302813476  False  False  False  False  False  False  False
8302813477  False  False  False  False  False  False  False
8302813478  False  False   True  False  False  False  False

Then get find all values where at least one True by any:

a = df3.any(axis=1)
print (a)
8302813476    False
8302813477    False
8302813478     True
dtype: bool

And last boolean indexing:

print (a[a].index)
Int64Index([8302813478], dtype='int64')

Another solution is instead squeeze use df1['col'].unique(), thank you Ted Petrou:

df3 = df2.stack().isin(df1['col'].unique()).unstack()
print (df3)
                1      2      3      4      5      6      7
8302813476  False  False  False  False  False  False  False
8302813477  False  False  False  False  False  False  False
8302813478  False  False   True  False  False  False  False

---

I like squeeze more, but same output is simple selecting column of df1:

df3 = df2.stack().isin(df1['col']).unstack()
print (df3)
                1      2      3      4      5      6      7
8302813476  False  False  False  False  False  False  False
8302813477  False  False  False  False  False  False  False
8302813478  False  False   True  False  False  False  False

Upvotes: 3

Related Questions