qpanas
qpanas

Reputation: 21

How to fix mixed up indexes

I have a problem with indexes in a dataframe. I have a data that looks kind of like this:

df1 = pd.DataFrame({'Name': ['ABC-123', 'DEF-456', 'GHI-789', 'JKL-321'],
'A1': [111, 222, 333, 444],
'A2': [555, 666, 777, 888],
'A3': [999, 123, 456, 789],
'A4': [101, 202, 303, 404],
'QQ': [1.3, 2.5, 3.7, 4.9]});

I was working with the data and at some point I have created a similar dataframe, with some of the data from the original one, but somewhere happened a mixup of indexes of the data, so the new data looks like this:

df2 = pd.DataFrame({'A1': [444, 222, 111, 333],
'A2': [888, 666, 555, 777],
'A3': [789, 123, 999, 456],
'A4': [404, 202, 101, 303]});

The data in rows is OK, just indexes are wrong, so I need the new dataframe as it is, but with the indexes of rows in the second dataframe matching the indexes in the first dataframe, that is for example the first row with numbers 444, 888, 789 and 404 should have index 3, not 0. How can I do that? I should probably compare each row of the second dataframe with the first dataframe, and if the row in a second dataframe matches the data in a row of the first dataframe, change the index of that row in second dataframe to the index of the matching row of the first dataframe, but I have no idea how to do that. Any suggestions?

Updates from comments:

  1. I need those indexes, but I need them right, so they match, that is the indexes in the second table follow the indexes in the first. Also, I cannot use any of the column individually, as non is unique. Only the whole row is unique.
  2. I should probably add that those two tables are not real, but created as an example, and I see now that the ID column is misleading. I have changed it to AA, so to avoid the confusion. So, as I said above, the columns are not unique, only the row as a whole is unique. The real tables have over 200 columns and thousands of rows.
  3. I cannot include the code, as it is large, complex and I have no idea where the mix-up happens. Also it is irrelevant, as the data I am talking about is already collected and saved in the CSV files, so the two dataframes would be in fact imports from those files. And here is my problem. I cannot reran the original code, even if I fix it, because it would take two weeks. I have to fix the indexes in the second dataframe, which I suspect is not that complicated if one knows how to do that. As I said, the data in rows is fine, just the indexes are mixed up
  4. No, it is just a name of this thing, the rest of the data is all about. By index I mean the dataframe index as it is created by default by pandas when the dataframe is created. The whole problem is, I have created the first dataframe by importing data from CSV file, done some very time consuming stuff to this data and exported the result to another CSV file. When later I have imported the second CSV file to do some analysis, I have discovered, that the indexes of rows that have the same data don't match. So now I have to fix indexes, or I will loose about two weeks of work
  5. Yes, it is a subset of rows and columns. The data should be the same, just indexes are mixed up. If I could compare each row of the df2 with df1, to find out what is the index of that row of data in the df1 and then change that row's index in the df2, that would fix the problem. I just need the indexes of rows in df2 matching the relevant rows indexes of df1. The df2 indexes are those mixed up. The indexes in the df1 are as they should be in both dataframes.
  6. No, the df2 has only some columns of df1 and some rows of df1, but the data will match between them. The order is different, so that means, the indexes are different, if this is the same. But I guess it is possible to have rows sorted that the indexes are not in an order.

Upvotes: 0

Views: 526

Answers (1)

Trenton McKinney
Trenton McKinney

Reputation: 62403

  • pandas.concat the two dataframes
    • concat the two dataframes, but only on the columns in both
    • the order of concatenation matters, df1 needs to be first
    • df1 will bring rows not in df2
  • Use .duplicated to create a mask
    • All the matching rows between df1 and df2 will be True, but the extra rows from df1 will be False
      • Apply the mask, which will remove the extra rows
      • dfc will now only have duplicate rows, and all the rows from df1 will be first
    • Finally, .drop_duplicates with keep='first' will drop all the duplicates from the bottom of dfc, thereby keeping only the rows from df1 with the correct indices.
import pandas as pd

# concat dataframes
dfc = pd.concat([df1[df2.columns], df2])

# drop non-duplicate rows and then drop duplicates, but keep the 1st
df2_correct_indices = dfc[dfc.duplicated(keep=False)].drop_duplicates(keep='first')

# display(df2_correct_indices)

    A1   A2   A3   A4
0  111  555  999  101
1  222  666  123  202
2  333  777  456  303
3  444  888  789  404

Sample Data Used

df1

df1 = pd.DataFrame({'Name': ['ABC-123', 'DEF-456', 'GHI-789', 'JKL-321', 'not_in_df2'],
                    'A1': [111, 222, 333, 444, 1000],
                    'A2': [555, 666, 777, 888, 1000],
                    'A3': [999, 123, 456, 789, 1000],
                    'A4': [101, 202, 303, 404, 1000],
                    'QQ': [1.3, 2.5, 3.7, 4.9, 1000]})

         Name    A1    A2    A3    A4      QQ
0     ABC-123   111   555   999   101     1.3
1     DEF-456   222   666   123   202     2.5
2     GHI-789   333   777   456   303     3.7
3     JKL-321   444   888   789   404     4.9
4  not_in_df2  1000  1000  1000  1000  1000.0

df2

df2 = pd.DataFrame({'A1': [444, 222, 111, 333],
                    'A2': [888, 666, 555, 777],
                    'A3': [789, 123, 999, 456],
                    'A4': [404, 202, 101, 303]})

    A1   A2   A3   A4
0  444  888  789  404
1  222  666  123  202
2  111  555  999  101
3  333  777  456  303

Upvotes: 1

Related Questions