Reputation: 930
Okay, so I have 2 DataFrames which both have one common column ('ID'
). Now I would like to add all keys and their values from df_2
to df_1
matching their respective 'ID'
. I added a minimum reproducible example below. In this case the IDs are in the same order. However, that must not be the case. Both DataFrames always contain the same IDs. Important to note is that each ID is not unique, however, the order of the occurrences of one ID is the same for both DataFrames.
x = {'ID':[1]*50+[2]*50),
'row':[9, 2, 8, 4, 2, 5, 2, 1, 1, 6, 5, 6, 3, 1, 1, 2, 1, 2, 3, 8, 3, 2,
8, 8, 7, 4, 2, 3, 7, 7, 7, 4, 3, 7, 5, 6, 8, 7, 4, 5, 7, 1, 8, 9,
4, 1, 6, 1, 7, 1, 5, 9, 9, 9, 4, 6, 5, 1, 7, 9, 5, 9, 6, 8, 6, 7,
4, 3, 4, 4, 7, 2, 7, 8, 6, 6, 8, 5, 4, 9, 9, 4, 3, 9, 7, 8, 3, 5,
1, 6, 1, 6, 8, 2, 1, 4, 3, 7, 7, 6],
'block':[2, 8, 8, 2, 4, 1, 6, 4, 2, 7, 7, 1, 3, 1, 4, 8, 4, 2, 5, 2, 4, 9,
3, 6, 6, 1, 4, 2, 7, 4, 4, 1, 9, 8, 5, 1, 1, 4, 6, 4, 9, 1, 7, 4,
5, 3, 6, 2, 6, 8, 5, 9, 5, 8, 1, 8, 3, 3, 5, 7, 3, 2, 5, 9, 6, 4,
3, 9, 7, 5, 2, 1, 2, 4, 6, 7, 7, 8, 2, 2, 7, 7, 9, 3, 8, 7, 3, 3,
1, 4, 6, 2, 6, 2, 8, 1, 1, 7, 5, 1]}
y = {'ID':[1]*50+[2]*50),
'Var1':[8, 3, 8, 3, 3, 2, 7, 8, 9, 6, 5, 6, 6, 3, 8, 3, 6, 4, 6, 2, 6, 2,
7, 3, 4, 7, 4, 9, 1, 5, 5, 5, 2, 9, 2, 5, 5, 1, 9, 9, 8, 2, 4, 2,
8, 7, 5, 8, 3, 9, 5, 9, 1, 1, 9, 9, 6, 3, 7, 1, 9, 8, 3, 3, 5, 7,
3, 8, 9, 1, 5, 2, 3, 3, 5, 5, 9, 6, 7, 4, 4, 1, 9, 7, 9, 8, 5, 4,
2, 5, 5, 8, 7, 5, 9, 9, 4, 5, 4, 8],
'Var2':[3, 3, 3, 8, 9, 5, 7, 5, 5, 3, 7, 5, 7, 7, 2, 1, 3, 9, 8, 5, 5, 1,
1, 8, 5, 3, 1, 5, 4, 3, 5, 2, 3, 2, 7, 3, 9, 4, 8, 4, 6, 6, 2, 3,
8, 3, 3, 6, 6, 4, 8, 4, 1, 9, 8, 9, 5, 7, 6, 9, 4, 6, 8, 5, 7, 2,
2, 8, 9, 1, 9, 4, 8, 8, 6, 8, 1, 1, 4, 7, 8, 4, 1, 2, 2, 9, 5, 3,
7, 4, 5, 4, 5, 1, 2, 3, 5, 9, 2, 4]}
df_1 = pd.DataFrame(x)
df_2 = pd.DataFrame(y)
My approach:
def add_analysis(data, ana):
for key in ana.keys():
if key == 'ID':
continue
add_db_key(ana, data, key)
def add_db_key(data_db, data_loaded, key):
for ID in data_db.ID:
data_loaded.loc[data_loaded['ID'] == ID, key] = data_db[data_db.ID == ID][key].values
add_analysis(df_1, df_2)
which works fine, however, it takes for large DataFrames forever, as I am iterating through every column and every row of the DataFrame. I feel like that this is very inefficient and that there is probably a merge operation that would do the trick, however, most importantly the IDs have to match and the order has to be preserved. Any help with pointers for which function to look at is highly appreciated. Thanks in advance.
Upvotes: 0
Views: 58
Reputation: 161
Your functions do not work for me (they have no return value). However, if I understand correctly, you can solve the problem with something like this:
df_1['ID_idx'] = df_1.reset_index().apply(lambda x: f"{x['ID']}_{x['index']}", axis=1)
df_2['ID_idx'] = df_1.reset_index().apply(lambda x: f"{x['ID']}_{x['index']}", axis=1)
df = df_1.merge(df_2, right_on='ID_idx', left_on='ID_idx')
and eventually drop the columns you don't want.
Upvotes: 1