Reputation: 363
I have 2 dataframes - df1 and df2 which look like below. I need to search the values from df2['Pid'] in all columns of df1 (columns - a through f) and then create a new column df1['ind'], which will hold values from df2['ind'] wherever a match between values of df2['Pid'] is found in df1. To me it looks like a expanded look up case. I used df2.isin(df1['PERSON_UID'])
to find to mark value found = true/ false in df1, but stuck at the creation of df1['ind'] column.
df1:
a b c d e f
0 0 2106 0 0 0
0 2103 0 0 0 0
0 2104 0 0 0 0
0 2105 0 0 0 0
2100 0 0 0 0 0
2101 0 0 0 0 0
2102 0 0 0 0 0
0 0 2107 0 0 0
0 0 2108 0 0 0
0 0 2109 0 0 0
0 0 2110 0 0 0
0 0 2111 0 0 0
0 0 0 2112. 0 0
0 0 0 2113 0 0
0 0 0 2114 0 0
0 0 0 0 2115 0
0 0 0 0 2116 0
0 0 0 0 0 2117
0 0 0 0 0 2118
0 0 0 0 0 2119
0 0 0 0 2120 0
df2:
Pid ind
2100 y
2101 n
2102 y
2103 n
2104 y
2105 n
2106 n
2107 n
2108 y
2109 y
2110 n
2111 y
2112 y
2113 y
2114 n
2115 n
2116 y
2117 y
2118 n
2119 y
2120 n
Desired op:
a b c d e f ind
0 0 2106 0 0 0 n
0 2103 0 0 0 0 n
0 2104 0 0 0 0 y
0 2105 0 0 0 0 n
2100 0 0 0 0 0 y
2101 0 0 0 0 0 n
2102 0 0 0 0 0 y
0 0 2107 0 0 0 n
0 0 2108 0 0 0 y
0 0 2109 0 0 0 y
0 0 2110 0 0 0 n
0 0 2111 0 0 0 y
0 0 0 2112. 0 0 y
0 0 0 2113 0 0 y
0 0 0 2114 0 0 n
0 0 0 0 2115 0 n
0 0 0 0 2116 0 y
0 0 0 0 0 2117 y
0 0 0 0 0 2118 n
0 0 0 0 0 2119 y
0 0 0 0 2120 0 n
Upvotes: 1
Views: 416
Reputation: 35155
@jezrael's answer is perfect, if the Pid is not a duplicate, then you need the sum I was thinking of combining them as an index.
df['Pid'] = df.sum(axis=1)
df['Pid'] = df['Pid'].astype(int)
df = pd.merge(df, df2, on='Pid', how='inner')
df.drop('Pid', axis=1, inplace=True)
df
a b c d e f ind
0 0 0 2106 0.0 0 0 n
1 0 2103 0 0.0 0 0 n
2 0 2104 0 0.0 0 0 y
3 0 2105 0 0.0 0 0 n
4 2100 0 0 0.0 0 0 y
5 2101 0 0 0.0 0 0 n
6 2102 0 0 0.0 0 0 y
7 0 0 2107 0.0 0 0 n
8 0 0 2108 0.0 0 0 y
9 0 0 2109 0.0 0 0 y
10 0 0 2110 0.0 0 0 n
11 0 0 2111 0.0 0 0 y
12 0 0 0 2112.0 0 0 y
13 0 0 0 2113.0 0 0 y
14 0 0 0 2114.0 0 0 n
15 0 0 0 0.0 2115 0 n
16 0 0 0 0.0 2116 0 y
17 0 0 0 0.0 0 2117 y
18 0 0 0 0.0 0 2118 n
19 0 0 0 0.0 0 2119 y
20 0 0 0 0.0 2120 0 n
Upvotes: 2
Reputation: 862761
Use:
df1['ind'] = df1.mask(df1.eq(0)).ffill(axis=1).iloc[:, -1].map(df2.set_index('Pid')['ind'])
print (df1)
a b c d e f ind
0 0 0 2106 0.0 0 0 n
1 0 2103 0 0.0 0 0 n
2 0 2104 0 0.0 0 0 y
3 0 2105 0 0.0 0 0 n
4 2100 0 0 0.0 0 0 y
5 2101 0 0 0.0 0 0 n
6 2102 0 0 0.0 0 0 y
7 0 0 2107 0.0 0 0 n
8 0 0 2108 0.0 0 0 y
9 0 0 2109 0.0 0 0 y
10 0 0 2110 0.0 0 0 n
11 0 0 2111 0.0 0 0 y
12 0 0 0 2112.0 0 0 y
13 0 0 0 2113.0 0 0 y
14 0 0 0 2114.0 0 0 n
15 0 0 0 0.0 2115 0 n
16 0 0 0 0.0 2116 0 y
17 0 0 0 0.0 0 2117 y
18 0 0 0 0.0 0 2118 n
19 0 0 0 0.0 0 2119 y
20 0 0 0 0.0 2120 0 n
Details:
First replace 0
values to missing values by DataFrame.mask
:
print (df1.mask(df1.eq(0)))
a b c d e f
0 NaN NaN 2106.0 NaN NaN NaN
1 NaN 2103.0 NaN NaN NaN NaN
2 NaN 2104.0 NaN NaN NaN NaN
3 NaN 2105.0 NaN NaN NaN NaN
4 2100.0 NaN NaN NaN NaN NaN
5 2101.0 NaN NaN NaN NaN NaN
6 2102.0 NaN NaN NaN NaN NaN
7 NaN NaN 2107.0 NaN NaN NaN
8 NaN NaN 2108.0 NaN NaN NaN
9 NaN NaN 2109.0 NaN NaN NaN
10 NaN NaN 2110.0 NaN NaN NaN
11 NaN NaN 2111.0 NaN NaN NaN
12 NaN NaN NaN 2112.0 NaN NaN
13 NaN NaN NaN 2113.0 NaN NaN
14 NaN NaN NaN 2114.0 NaN NaN
15 NaN NaN NaN NaN 2115.0 NaN
16 NaN NaN NaN NaN 2116.0 NaN
17 NaN NaN NaN NaN NaN 2117.0
18 NaN NaN NaN NaN NaN 2118.0
19 NaN NaN NaN NaN NaN 2119.0
20 NaN NaN NaN NaN 2120.0 NaN
Then forward filling missing values:
print (df1.mask(df1.eq(0)).ffill(axis=1))
a b c d e f
0 NaN NaN 2106.0 2106.0 2106.0 2106.0
1 NaN 2103.0 2103.0 2103.0 2103.0 2103.0
2 NaN 2104.0 2104.0 2104.0 2104.0 2104.0
3 NaN 2105.0 2105.0 2105.0 2105.0 2105.0
4 2100.0 2100.0 2100.0 2100.0 2100.0 2100.0
5 2101.0 2101.0 2101.0 2101.0 2101.0 2101.0
6 2102.0 2102.0 2102.0 2102.0 2102.0 2102.0
7 NaN NaN 2107.0 2107.0 2107.0 2107.0
8 NaN NaN 2108.0 2108.0 2108.0 2108.0
9 NaN NaN 2109.0 2109.0 2109.0 2109.0
10 NaN NaN 2110.0 2110.0 2110.0 2110.0
11 NaN NaN 2111.0 2111.0 2111.0 2111.0
12 NaN NaN NaN 2112.0 2112.0 2112.0
13 NaN NaN NaN 2113.0 2113.0 2113.0
14 NaN NaN NaN 2114.0 2114.0 2114.0
15 NaN NaN NaN NaN 2115.0 2115.0
16 NaN NaN NaN NaN 2116.0 2116.0
17 NaN NaN NaN NaN NaN 2117.0
18 NaN NaN NaN NaN NaN 2118.0
19 NaN NaN NaN NaN NaN 2119.0
20 NaN NaN NaN NaN 2120.0 2120.0
Select last column by position with DataFrame.iloc
:
print (df1.mask(df1.eq(0)).ffill(axis=1).iloc[:, -1])
0 2106.0
1 2103.0
2 2104.0
3 2105.0
4 2100.0
5 2101.0
6 2102.0
7 2107.0
8 2108.0
9 2109.0
10 2110.0
11 2111.0
12 2112.0
13 2113.0
14 2114.0
15 2115.0
16 2116.0
17 2117.0
18 2118.0
19 2119.0
20 2120.0
Name: f, dtype: float64
And last use Series.map
.
Upvotes: 1