orange
orange

Reputation: 8090

Get rows that are present in one dataframe, but not the other

I'd like to extract those rows from df1 which are not existent in df2 (identity is the index). For the below example, I would expect the first row in df1 to be returned. Unfortunately, the result is empty.

import pandas as pd

df1 = pd.DataFrame({
    'level-0': ['a', 'a', 'a', 'a', 'a', 'a'],
    'level-1': ['s2', 's2', 's2', 's2', 's2', 's2'],
    'level-2': ['1', '1', '1', '1', '1', '1'],
    'level-3': ['19', '20', '21', '22', '23', '24'],
    'level-4': ['HRB', 'HRB', 'HRB', 'HRB', 'HRB', 'HRB'],
    'name': ['a', 'b', 'c', 'd', 'e', 'f']
})

df1 = df1.set_index(['level-0', 'level-1', 'level-2', 'level-3', 'level-4'], drop=False)

df2 = pd.DataFrame({
    'level-0': ['a', 'a', 'a', 'a', 'a', 'b'],
    'level-1': ['s2', 's2', 's2', 's2', 's2', 's2'],
    'level-2': ['1', '1', '1', '1', '1', '1'],
    'level-3': ['19', '20', '21', '22', '23', '24'],
    'level-4': ['HRB', 'HRB', 'HRB', 'HRB', 'HRB', 'HRB']
})
df2 = df2.set_index(['level-0', 'level-1', 'level-2', 'level-3', 'level-4'], drop=False)

# all indices that are in df1 but not in df2
df_unknown = df1[~df1.index.isin(df2.index)]
print df_unknown

What's wrong with the selection?

Update

I figured out what went wrong. The dataframes were read from an Excel file and some Series were interpreted as int, while the dataframe to compare with had its columns already converted to str. This resulted in different indices.

Upvotes: 0

Views: 358

Answers (1)

filmor
filmor

Reputation: 32212

set_index is not in place by default, so df1 and df2 still have their numeric index after the call. Do either

df2.set_index(..., inplace=True)

or

df2 = df2.set_index(...)

You will see that by far the most methods in pandas work that way.

Upvotes: 1

Related Questions