Reputation: 21
I have a problem with indexes in a dataframe. I have a data that looks kind of like this:
df1 = pd.DataFrame({'Name': ['ABC-123', 'DEF-456', 'GHI-789', 'JKL-321'],
'A1': [111, 222, 333, 444],
'A2': [555, 666, 777, 888],
'A3': [999, 123, 456, 789],
'A4': [101, 202, 303, 404],
'QQ': [1.3, 2.5, 3.7, 4.9]});
I was working with the data and at some point I have created a similar dataframe, with some of the data from the original one, but somewhere happened a mixup of indexes of the data, so the new data looks like this:
df2 = pd.DataFrame({'A1': [444, 222, 111, 333],
'A2': [888, 666, 555, 777],
'A3': [789, 123, 999, 456],
'A4': [404, 202, 101, 303]});
The data in rows is OK, just indexes are wrong, so I need the new dataframe as it is, but with the indexes of rows in the second dataframe matching the indexes in the first dataframe, that is for example the first row with numbers 444, 888, 789 and 404 should have index 3, not 0. How can I do that? I should probably compare each row of the second dataframe with the first dataframe, and if the row in a second dataframe matches the data in a row of the first dataframe, change the index of that row in second dataframe to the index of the matching row of the first dataframe, but I have no idea how to do that. Any suggestions?
Upvotes: 0
Views: 526
Reputation: 62403
pandas.concat
the two dataframes
concat
the two dataframes, but only on the columns in bothdf1
needs to be firstdf1
will bring rows not in df2
.duplicated
to create a mask
df1
and df2
will be True
, but the extra rows from df1
will be False
dfc
will now only have duplicate rows, and all the rows from df1
will be first.drop_duplicates
with keep='first'
will drop all the duplicates from the bottom of dfc
, thereby keeping only the rows from df1
with the correct indices.import pandas as pd
# concat dataframes
dfc = pd.concat([df1[df2.columns], df2])
# drop non-duplicate rows and then drop duplicates, but keep the 1st
df2_correct_indices = dfc[dfc.duplicated(keep=False)].drop_duplicates(keep='first')
# display(df2_correct_indices)
A1 A2 A3 A4
0 111 555 999 101
1 222 666 123 202
2 333 777 456 303
3 444 888 789 404
df1
df1 = pd.DataFrame({'Name': ['ABC-123', 'DEF-456', 'GHI-789', 'JKL-321', 'not_in_df2'],
'A1': [111, 222, 333, 444, 1000],
'A2': [555, 666, 777, 888, 1000],
'A3': [999, 123, 456, 789, 1000],
'A4': [101, 202, 303, 404, 1000],
'QQ': [1.3, 2.5, 3.7, 4.9, 1000]})
Name A1 A2 A3 A4 QQ
0 ABC-123 111 555 999 101 1.3
1 DEF-456 222 666 123 202 2.5
2 GHI-789 333 777 456 303 3.7
3 JKL-321 444 888 789 404 4.9
4 not_in_df2 1000 1000 1000 1000 1000.0
df2
df2 = pd.DataFrame({'A1': [444, 222, 111, 333],
'A2': [888, 666, 555, 777],
'A3': [789, 123, 999, 456],
'A4': [404, 202, 101, 303]})
A1 A2 A3 A4
0 444 888 789 404
1 222 666 123 202
2 111 555 999 101
3 333 777 456 303
Upvotes: 1