Reputation: 3263
I cannot figure out a nice panda-ish way to fill in missing NaN values for left join by sampling from right table.
e.g joined_left = left.merge(right, how="left", left_on=[attr1], right_on=[attr2]) from left and right
0 1 2
0 1 1 1
1 2 2 2
2 3 3 3
3 9 9 9
4 1 3 2
0 1 2
0 1 2 2
1 1 2 3
2 3 2 2
3 3 2 9
4 3 2 2
produces smth like
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 NaN NaN
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 NaN NaN
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0
How do I sample a row from a right table instead of filling NaNs?
This is what I tried so far playground:
left = [[1,1,1], [2,2,2],[3,3,3], [9,9,9], [1,3,2]]
right = [[1,2,2],[1,2,3],[3,2,2], [3,2,9], [3,2,2]]
left = np.asarray(left)
right = np.asarray(right)
left = pd.DataFrame(left)
right = pd.DataFrame(right)
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0])
while(joined_left.isnull().values.any()):
right_sample = right.sample().drop(0, axis=1)
joined_left.fillna(value=right_sample, limit=1)
print joined_left
Basically sample randomly and use fillna() for first occurance of NaN value to fill in...but for some reason I get no output.
Thank you!
One of outputs could be
0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 2.0 2.0
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 3.0 2.9
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0
with sampled 3 2 2
and3 2 9
Upvotes: 0
Views: 870
Reputation: 323326
Using sample
with fillna
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0],indicator=True) # adding indicator
joined_left
Out[705]:
0 1_x 2_x 1_y 2_y _merge
0 1 1 1 2.0 2.0 both
1 1 1 1 2.0 3.0 both
2 2 2 2 NaN NaN left_only
3 3 3 3 2.0 2.0 both
4 3 3 3 2.0 9.0 both
5 3 3 3 2.0 2.0 both
6 9 9 9 NaN NaN left_only
7 1 3 2 2.0 2.0 both
8 1 3 2 2.0 3.0 both
nnull=joined_left['_merge'].eq('left_only').sum() # find all many row miss match , at the mergedf
s=right.sample(nnull)# rasmple from the dataframe after dropna
s.index=joined_left.index[joined_left['_merge'].eq('left_only')] # reset the index of the subset fill df to the index of null value show up
joined_left.fillna(s.rename(columns={1:'1_y',2:'2_y'}))
Out[706]:
0 1_x 2_x 1_y 2_y _merge
0 1 1 1 2.0 2.0 both
1 1 1 1 2.0 3.0 both
2 2 2 2 2.0 2.0 left_only
3 3 3 3 2.0 2.0 both
4 3 3 3 2.0 9.0 both
5 3 3 3 2.0 2.0 both
6 9 9 9 2.0 3.0 left_only
7 1 3 2 2.0 2.0 both
8 1 3 2 2.0 3.0 both
Upvotes: 1