DanielSchloß
DanielSchloß

Reputation: 71

How to merge (inner) two columns of a dataframe with pandas/python?

I have a dataframe containing two columns; A_ID and R_ID.

I want to update R_ID to only contain values that are also in A_ID, the rest should be deleted (also NaN). The values should stay at the same position/index. I get that this is an inner join but with my proposed solution I got several problems.

Example:

import pandas as pd
data = {'A_ID': ['1E2', '1E3', '1E4', '1E5'], 'R_ID': ['1E7',[np.nan],[np.nan],"1E4",]}
df = pd.DataFrame(data)
print(df)

I tried

df_A_ID = df[["A_ID"]]
df_R_ID = df[["R_ID"]]
new_df = pd.merge(df_A_ID, df_R_ID, how='inner', left_on='A_ID', right_on ='R_ID', right_index=True)

and

new_df = pd.concat([dataset_A_ID, dataset_R_ID],join="inner")

But with the first option I get an "You are trying to merge on object and int64 columns" error, even though both columns are of d.types object and with the second one i get an empty DataFrame.

My expected output would be the same dataframe as before but with R_ID only containing values that are also in the column A_ID, at the same index/position:

data = {'A_ID': ['1E2', '1E3', '1E4', '1E5'], 'R_ID': [[np.nan],[np.nan],[np.nan],"1E4",]}
df = pd.DataFrame(data)
print(df)

Upvotes: 1

Views: 171

Answers (3)

voldr
voldr

Reputation: 392

It should work

df_A_ID = df[["A_ID"]].astype(dtype=pd.StringDtype())
df_R_ID = df[["R_ID"]].astype(dtype=pd.StringDtype()).reset_index()
temp_df = pd.merge(df_A_ID, df_R_ID, how='inner', left_on='A_ID', right_on ='R_ID').set_index('index')

df.loc[~(df_R_ID.isin(temp_df[['R_ID']])['R_ID']).fillna(False),'R_ID'] = [np.nan]

Output

  A_ID R_ID
0  1E2  NaN
1  1E3  NaN
2  1E4  1E4
3  1E5  NaN

Upvotes: 0

jezrael
jezrael

Reputation: 862691

Set NaN by Series.where if no match columns compared by Series.isin:

#solution working with scalar NaNs
data = {'A_ID': ['1E2', '1E3', '1E4', '1E5'], 'R_ID': ['1E7',np.nan,np.nan,"1E4",]}
df = pd.DataFrame(data)
print(df)
  A_ID R_ID
0  1E2  1E7
1  1E3  NaN
2  1E4  NaN
3  1E5  1E4

df['R_ID'] = df['R_ID'].where(df["R_ID"].isin(df["A_ID"]))
print(df)
  A_ID R_ID
0  1E2  NaN
1  1E3  NaN
2  1E4  NaN
3  1E5  1E4

Or:

df.loc[~df["R_ID"].isin(df["A_ID"]), 'R_ID'] = np.nan

Upvotes: 1

Corralien
Corralien

Reputation: 120419

Use isin:

df['R_ID'] = df['R_ID'].loc[df['R_ID'].isin(df['A_ID'])]
>>> df
  A_ID R_ID
0  1E2  NaN
1  1E3  NaN
2  1E4  NaN
3  1E5  1E4

Upvotes: 0

Related Questions