cyrus24
cyrus24

Reputation: 363

Search and lookup Search values from one dataframe in another dataframe and populate new column based on look up values in pandas

I have 2 dataframes - df1 and df2 which look like below. I need to search the values from df2['Pid'] in all columns of df1 (columns - a through f) and then create a new column df1['ind'], which will hold values from df2['ind'] wherever a match between values of df2['Pid'] is found in df1. To me it looks like a expanded look up case. I used df2.isin(df1['PERSON_UID']) to find to mark value found = true/ false in df1, but stuck at the creation of df1['ind'] column.

df1:

a       b       c           d       e       f
0       0       2106        0       0       0
0       2103    0           0       0       0
0       2104    0           0       0       0
0       2105    0           0       0       0
2100    0       0           0       0       0
2101    0       0           0       0       0
2102    0       0           0       0       0
0       0       2107        0       0       0
0       0       2108        0       0       0
0       0       2109        0       0       0
0       0       2110        0       0       0
0       0       2111        0       0       0
0       0       0           2112.   0       0
0       0       0           2113    0       0
0       0       0           2114    0       0
0       0       0           0       2115    0
0       0       0           0       2116    0
0       0       0           0       0       2117
0       0       0           0       0       2118
0       0       0           0       0       2119
0       0       0           0       2120    0

df2:

Pid       ind
2100        y
2101        n
2102        y
2103        n
2104        y
2105        n
2106        n
2107        n
2108        y
2109        y
2110        n
2111        y
2112        y
2113        y
2114        n
2115        n
2116        y
2117        y
2118        n
2119        y
2120        n

Desired op:

a       b       c           d       e       f       ind
0       0       2106        0       0       0       n
0       2103    0           0       0       0       n
0       2104    0           0       0       0       y
0       2105    0           0       0       0       n
2100    0       0           0       0       0       y
2101    0       0           0       0       0       n
2102    0       0           0       0       0       y
0       0       2107        0       0       0       n
0       0       2108        0       0       0       y
0       0       2109        0       0       0       y
0       0       2110        0       0       0       n
0       0       2111        0       0       0       y
0       0       0           2112.   0       0       y
0       0       0           2113    0       0       y
0       0       0           2114    0       0       n
0       0       0           0       2115    0       n
0       0       0           0       2116    0       y
0       0       0           0       0       2117    y
0       0       0           0       0       2118    n
0       0       0           0       0       2119    y
0       0       0           0       2120    0       n

Upvotes: 1

Views: 416

Answers (2)

r-beginners
r-beginners

Reputation: 35155

@jezrael's answer is perfect, if the Pid is not a duplicate, then you need the sum I was thinking of combining them as an index.

df['Pid'] = df.sum(axis=1)
df['Pid'] = df['Pid'].astype(int)
df = pd.merge(df, df2, on='Pid', how='inner')
df.drop('Pid', axis=1, inplace=True)

df
    a   b   c   d   e   f   ind
0   0   0   2106    0.0     0   0   n
1   0   2103    0   0.0     0   0   n
2   0   2104    0   0.0     0   0   y
3   0   2105    0   0.0     0   0   n
4   2100    0   0   0.0     0   0   y
5   2101    0   0   0.0     0   0   n
6   2102    0   0   0.0     0   0   y
7   0   0   2107    0.0     0   0   n
8   0   0   2108    0.0     0   0   y
9   0   0   2109    0.0     0   0   y
10  0   0   2110    0.0     0   0   n
11  0   0   2111    0.0     0   0   y
12  0   0   0   2112.0      0   0   y
13  0   0   0   2113.0      0   0   y
14  0   0   0   2114.0      0   0   n
15  0   0   0   0.0     2115    0   n
16  0   0   0   0.0     2116    0   y
17  0   0   0   0.0     0   2117    y
18  0   0   0   0.0     0   2118    n
19  0   0   0   0.0     0   2119    y
20  0   0   0   0.0     2120    0   n

Upvotes: 2

jezrael
jezrael

Reputation: 862761

Use:

df1['ind'] = df1.mask(df1.eq(0)).ffill(axis=1).iloc[:, -1].map(df2.set_index('Pid')['ind'])
print (df1)
       a     b     c       d     e     f ind
0      0     0  2106     0.0     0     0   n
1      0  2103     0     0.0     0     0   n
2      0  2104     0     0.0     0     0   y
3      0  2105     0     0.0     0     0   n
4   2100     0     0     0.0     0     0   y
5   2101     0     0     0.0     0     0   n
6   2102     0     0     0.0     0     0   y
7      0     0  2107     0.0     0     0   n
8      0     0  2108     0.0     0     0   y
9      0     0  2109     0.0     0     0   y
10     0     0  2110     0.0     0     0   n
11     0     0  2111     0.0     0     0   y
12     0     0     0  2112.0     0     0   y
13     0     0     0  2113.0     0     0   y
14     0     0     0  2114.0     0     0   n
15     0     0     0     0.0  2115     0   n
16     0     0     0     0.0  2116     0   y
17     0     0     0     0.0     0  2117   y
18     0     0     0     0.0     0  2118   n
19     0     0     0     0.0     0  2119   y
20     0     0     0     0.0  2120     0   n

Details:

First replace 0 values to missing values by DataFrame.mask:

print (df1.mask(df1.eq(0)))
         a       b       c       d       e       f
0      NaN     NaN  2106.0     NaN     NaN     NaN
1      NaN  2103.0     NaN     NaN     NaN     NaN
2      NaN  2104.0     NaN     NaN     NaN     NaN
3      NaN  2105.0     NaN     NaN     NaN     NaN
4   2100.0     NaN     NaN     NaN     NaN     NaN
5   2101.0     NaN     NaN     NaN     NaN     NaN
6   2102.0     NaN     NaN     NaN     NaN     NaN
7      NaN     NaN  2107.0     NaN     NaN     NaN
8      NaN     NaN  2108.0     NaN     NaN     NaN
9      NaN     NaN  2109.0     NaN     NaN     NaN
10     NaN     NaN  2110.0     NaN     NaN     NaN
11     NaN     NaN  2111.0     NaN     NaN     NaN
12     NaN     NaN     NaN  2112.0     NaN     NaN
13     NaN     NaN     NaN  2113.0     NaN     NaN
14     NaN     NaN     NaN  2114.0     NaN     NaN
15     NaN     NaN     NaN     NaN  2115.0     NaN
16     NaN     NaN     NaN     NaN  2116.0     NaN
17     NaN     NaN     NaN     NaN     NaN  2117.0
18     NaN     NaN     NaN     NaN     NaN  2118.0
19     NaN     NaN     NaN     NaN     NaN  2119.0
20     NaN     NaN     NaN     NaN  2120.0     NaN

Then forward filling missing values:

print (df1.mask(df1.eq(0)).ffill(axis=1))
         a       b       c       d       e       f
0      NaN     NaN  2106.0  2106.0  2106.0  2106.0
1      NaN  2103.0  2103.0  2103.0  2103.0  2103.0
2      NaN  2104.0  2104.0  2104.0  2104.0  2104.0
3      NaN  2105.0  2105.0  2105.0  2105.0  2105.0
4   2100.0  2100.0  2100.0  2100.0  2100.0  2100.0
5   2101.0  2101.0  2101.0  2101.0  2101.0  2101.0
6   2102.0  2102.0  2102.0  2102.0  2102.0  2102.0
7      NaN     NaN  2107.0  2107.0  2107.0  2107.0
8      NaN     NaN  2108.0  2108.0  2108.0  2108.0
9      NaN     NaN  2109.0  2109.0  2109.0  2109.0
10     NaN     NaN  2110.0  2110.0  2110.0  2110.0
11     NaN     NaN  2111.0  2111.0  2111.0  2111.0
12     NaN     NaN     NaN  2112.0  2112.0  2112.0
13     NaN     NaN     NaN  2113.0  2113.0  2113.0
14     NaN     NaN     NaN  2114.0  2114.0  2114.0
15     NaN     NaN     NaN     NaN  2115.0  2115.0
16     NaN     NaN     NaN     NaN  2116.0  2116.0
17     NaN     NaN     NaN     NaN     NaN  2117.0
18     NaN     NaN     NaN     NaN     NaN  2118.0
19     NaN     NaN     NaN     NaN     NaN  2119.0
20     NaN     NaN     NaN     NaN  2120.0  2120.0

Select last column by position with DataFrame.iloc:

print (df1.mask(df1.eq(0)).ffill(axis=1).iloc[:, -1])
0     2106.0
1     2103.0
2     2104.0
3     2105.0
4     2100.0
5     2101.0
6     2102.0
7     2107.0
8     2108.0
9     2109.0
10    2110.0
11    2111.0
12    2112.0
13    2113.0
14    2114.0
15    2115.0
16    2116.0
17    2117.0
18    2118.0
19    2119.0
20    2120.0
Name: f, dtype: float64

And last use Series.map.

Upvotes: 1

Related Questions