Reputation: 71
I have a dataframe containing two columns; A_ID and R_ID.
I want to update R_ID to only contain values that are also in A_ID, the rest should be deleted (also NaN). The values should stay at the same position/index. I get that this is an inner join but with my proposed solution I got several problems.
Example:
import pandas as pd
data = {'A_ID': ['1E2', '1E3', '1E4', '1E5'], 'R_ID': ['1E7',[np.nan],[np.nan],"1E4",]}
df = pd.DataFrame(data)
print(df)
I tried
df_A_ID = df[["A_ID"]]
df_R_ID = df[["R_ID"]]
new_df = pd.merge(df_A_ID, df_R_ID, how='inner', left_on='A_ID', right_on ='R_ID', right_index=True)
and
new_df = pd.concat([dataset_A_ID, dataset_R_ID],join="inner")
But with the first option I get an "You are trying to merge on object and int64 columns" error, even though both columns are of d.types object and with the second one i get an empty DataFrame.
My expected output would be the same dataframe as before but with R_ID only containing values that are also in the column A_ID, at the same index/position:
data = {'A_ID': ['1E2', '1E3', '1E4', '1E5'], 'R_ID': [[np.nan],[np.nan],[np.nan],"1E4",]}
df = pd.DataFrame(data)
print(df)
Upvotes: 1
Views: 171
Reputation: 392
It should work
df_A_ID = df[["A_ID"]].astype(dtype=pd.StringDtype())
df_R_ID = df[["R_ID"]].astype(dtype=pd.StringDtype()).reset_index()
temp_df = pd.merge(df_A_ID, df_R_ID, how='inner', left_on='A_ID', right_on ='R_ID').set_index('index')
df.loc[~(df_R_ID.isin(temp_df[['R_ID']])['R_ID']).fillna(False),'R_ID'] = [np.nan]
Output
A_ID R_ID
0 1E2 NaN
1 1E3 NaN
2 1E4 1E4
3 1E5 NaN
Upvotes: 0
Reputation: 862691
Set NaN
by Series.where
if no match columns compared by Series.isin
:
#solution working with scalar NaNs
data = {'A_ID': ['1E2', '1E3', '1E4', '1E5'], 'R_ID': ['1E7',np.nan,np.nan,"1E4",]}
df = pd.DataFrame(data)
print(df)
A_ID R_ID
0 1E2 1E7
1 1E3 NaN
2 1E4 NaN
3 1E5 1E4
df['R_ID'] = df['R_ID'].where(df["R_ID"].isin(df["A_ID"]))
print(df)
A_ID R_ID
0 1E2 NaN
1 1E3 NaN
2 1E4 NaN
3 1E5 1E4
Or:
df.loc[~df["R_ID"].isin(df["A_ID"]), 'R_ID'] = np.nan
Upvotes: 1
Reputation: 120419
Use isin
:
df['R_ID'] = df['R_ID'].loc[df['R_ID'].isin(df['A_ID'])]
>>> df
A_ID R_ID
0 1E2 NaN
1 1E3 NaN
2 1E4 NaN
3 1E5 1E4
Upvotes: 0