toto_tico
toto_tico

Reputation: 19037

How to compare two dataframes ignoring column names?

Suppose I want to compare the content of two dataframes, but not the column names (or index names). Is it possible to achieve this without renaming the columns?

For example:

df = pd.DataFrame({'A': [1,2], 'B':[3,4]})
df_equal = pd.DataFrame({'a': [1,2], 'b':[3,4]})
df_diff = pd.DataFrame({'A': [1,2], 'B':[3,5]})

In this case, df is df_equal but different to df_diff, because the values in df_equal has the same content, but the ones in df_diff. Notice that the column names in df_equal are different, but I still want to get a true value.

I have tried the following:

equals:

# Returns false because of the column names
df.equals(df_equal)

eq:

# doesn't work as it compares four columns (A,B,a,b) assuming nulls for the one that doesn't exist
df.eq(df_equal).all().all()

pandas.testing.assert_frame_equal:

# same as equals
pd.testing.assert_frame_equal(df, df_equal, check_names=False)

I thought that it was going to be possible to use the assert_frame_equal, but none of the parameters seem to work to ignore column names.

Upvotes: 3

Views: 8145

Answers (3)

toto_tico
toto_tico

Reputation: 19037

I just needed to get the values (numpy array) from the data frame, so the column names won't be considered.

df.eq(df_equal.values).all().all()

I would still like to see a parameter on equals, or assert_frame_equal. Maybe I am missing something.


An advantage of this compared to @jpp answer is that, I can get see which columns do not match, calling only all() only once:

df.eq(df_diff.values).all()
Out[24]: 
A     True
B    False
dtype: bool

One problem is that when eq is used, then np.nan is not equal to np.nan, in which case the following expression, would serve well:

(df.eq(df_equal.values) | (df.isnull().values & df_equal.isnull().values)).all().all()

Upvotes: 2

Fredcpp
Fredcpp

Reputation: 1

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

for i in range(df1.shape[0]):
    for j in range(df1.shape[1]):
        print(df1.iloc[i, j] == df2.iloc[i, j])

Will return:

True
True
True
True

Same thing for:

df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

One obvious issue is that column names matters in Pandas to sort dataframes. For example:

df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'a': [1, 2], 'B': [3, 4]})
print(df1)
print(df2)

renders as ('B' is before 'a' in df2):

   a  b
0  1  3
1  2  4
   B  a
0  3  1
1  4  2

Upvotes: 0

jpp
jpp

Reputation: 164713

pd.DataFrame is built around pd.Series, so it's unlikely you will be able to perform comparisons without column names.

But the most efficient way would be to drop down to numpy:

assert_equal = (df.values == df_equal.values).all()

To deal with np.nan, you can use np.testing.assert_equal and catch AssertionError, as suggested by @Avaris :

import numpy as np

def nan_equal(a,b):
    try:
        np.testing.assert_equal(a,b)
    except AssertionError:
        return False
    return True

assert_equal = nan_equal(df.values, df_equal.values)

Upvotes: 4

Related Questions