Reputation: 5769
I've got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to do nothing...
How can I compare two dataframes to check if they're the same or not?
csvdata = pandas.read_csv('csvfile.csv')
csvdata_old = csvdata
# ... do stuff with csvdata dataframe
if csvdata_old != csvdata:
csvdata.to_csv('csvfile.csv', index=False)
Any ideas?
Upvotes: 90
Views: 174663
Reputation: 1381
Some of the answers provided fail to detect differences in datasets because they don't consider index values. Let's build two simple datasets and analyze some of the proposed approaches:
import pandas as pd
df_1 = pd.DataFrame({'a': [1,1], 'b': [2,1]})
df_1
a | b | |
---|---|---|
0 | 1 | 2 |
1 | 1 | 1 |
df_2 = pd.DataFrame({'a': [1,1], 'b': [1,2]})
df_2
a | b | |
---|---|---|
0 | 1 | 1 |
1 | 1 | 2 |
These two dataframes are different because the first row in the df_1
is clearly not the same as the first row in the df_2
. If we apply the method proposed by Tom Chapin, it will not detect the difference because it only compares the combination of values for each column in each row, regardless of the row position (index). So in this example, since the combination (1,2) for columns a and b is already present in df_1
, it's considered a duplicate when merging with df_2
, even if it shows up at index 1 instead of 0. For the same reason the method proposed by leersej fails as well (you can try to apply these methods and see the results by yourself). The .equals
method proposed by Surya is able to detect the difference, returning False
in the comparison, but as pointed out by Dennis Golomazov, if you're interested in also checking the equality of indexes, you should add another check, as showed in his answer. In the end if you want to ensure that dataframes have the same column values at the same indexes, I think that the best method is to perform a direct match, as suggested by alpha_989, however the usage of the all()
method to check the final equality between the two fails, because it iterates over column names and not the boolean outcomes resulting from the comparison. all(df_1 == df_2)
in fact, returns wrongly True
. At the end of the day, to safely check if two dataframes are equal (using only the pandas
library), you should:
# 1 compare the two datasets
>>> comparison = (df_1 == df_2)
# 2 - see what columns are matching and what aren't
>>> comparison.all()
a True
b False
dtype: bool
# 3 - compare if all columns are matching
>>> comparison.all().all() # equivalent all(comparison.all())
False
# BONUS - row indexes that aren't matching (more info here: https://stackoverflow.com/a/52173171/11764049)
>>> comparison.T.all()[~comparison.T.all()].index
Int64Index([0, 1], dtype='int64')
In order to check the equality of two dataframes, considering also row indexes you can use the .equals
method as here, or:
comparison = (df_1 == df_2)
equality = comparison.all().all()
not_matching_row_idxs = comparison.T.all()[~comparison.T.all()].index
This approach also allows you to find the row indexes that are not matching.
Note: This method only works with dataframes having same index values.
Upvotes: 1
Reputation: 37
I hope this below code snippet helps you!
import pandas as pd
import datacompy
df_old_original = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [7, 7, 7, 7], [3, 3, 3, 3], [4, 4, 4, 4], [7, 7, 7, 7], [5, 5, 5, 5], [6, 6, 6, 6]], columns=['A', 'B', 'C', 'D'], index=[0, 1, 2, 3, 4, 5, 6, 7], dtype=object)
df_new_original = pd.DataFrame([[None, None, None, None], [1, 1, 1, 1], [2, 2, 2, 2], [8, 8, 8, 8], [3, 3, 3, 3], [4, 4, 4, 4], [7, 7, 7, 7], [5, 5, 5, 5], [None, None, None, None]], columns=['A', 'B', 'C', 'D'], index=[0, 1, 2, 3, 4, 5, 6, 7, 8], dtype=object)
compare = datacompy.Compare(df_old_original, df_new_original, join_columns=['A', 'B', 'C', 'D'], abs_tol=0, rel_tol=0, df1_name='Old', df2_name='New')
changes_in_old_df = compare.df1_unq_rows
changes_in_new_df = compare.df2_unq_rows
print(changes_in_old_df)
print(changes_in_new_df)
print(Compare.report())
Upvotes: 1
Reputation: 14958
To pull out the symmetric differences:
df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
For example:
df1 = pd.DataFrame({
'num': [1, 4, 3],
'name': ['a', 'b', 'c'],
})
df2 = pd.DataFrame({
'num': [1, 2, 3],
'name': ['a', 'b', 'd'],
})
Will yield:
Note: until the next release of pandas, to avoid the warning about how the sort argument will be set in the future, just add the sort=False
argument. As below:
df_diff = pd.concat([df1,df2], sort=False).drop_duplicates(keep=False)
Upvotes: 8
Reputation: 5350
In my case, I had a weird error, whereby even though the indices, column-names
and values were same, the DataFrames
didnt match. I tracked it down to the
data-types, and it seems pandas
can sometimes use different datatypes,
resulting in such problems
For example:
param2 = pd.DataFrame({'a': [1]})
param1 = pd.DataFrame({'a': [1], 'b': [2], 'c': [2], 'step': ['alpha']})
if you check param1.dtypes
and param2.dtypes
, you will find that 'a' is of
type object
for param1
and is of type int64
for param2
. Now, if you do
some manipulation using a combination of param1
and param2
, other
parameters of the dataframe will deviate from the default ones.
So after the final dataframe is generated, even though the actual values that
are printed out are same, final_df1.equals(final_df2)
, may turn out to be
not-equal, because those samll parameters like Axis 1
, ObjectBlock
,
IntBlock
maynot be the same.
A easy way to get around this and compare the values is to use
final_df1==final_df2
.
However, this will do a element by element comparison, so it wont work if you
are using it to assert a statement for example in pytest
.
What works well is
all(final_df1 == final_df2)
.
This does a element by element comparison, while neglecting the parameters not important for comparison.
If your values and indices are same, but final_df1.equals(final_df2)
is showing False
, you can use final_df1._data
and final_df2._data
to check the rest of the elements of the dataframes.
Upvotes: 8
Reputation: 3514
This compares the values of two dataframes note the number of row/columns needs to be the same between tables
comparison_array = table.values == expected_table.values
print (comparison_array)
>>>[[True, True, True]
[True, False, True]]
if False in comparison_array:
print ("Not the same")
#Return the position of the False values
np.where(comparison_array==False)
>>>(array([1]), array([1]))
You could then use this index information to return the value that does not match between tables. Since it's zero indexed, it's referring to the 2nd array in the 2nd position which is correct.
Upvotes: 7
Reputation: 3556
Not sure if this is helpful or not, but I whipped together this quick python method for returning just the differences between two dataframes that both have the same columns and shape.
def get_different_rows(source_df, new_df):
"""Returns just the rows from the new dataframe that differ from the source dataframe"""
merged_df = source_df.merge(new_df, indicator=True, how='outer')
changed_rows_df = merged_df[merged_df['_merge'] == 'right_only']
return changed_rows_df.drop('_merge', axis=1)
Upvotes: 49
Reputation: 17339
A more accurate comparison should check for index names separately, because DataFrame.equals
does not test for that. All the other properties (index values (single/multiindex), values, columns, dtypes) are checked by it correctly.
df1 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'name'])
df1 = df1.set_index('name')
df2 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'another_name'])
df2 = df2.set_index('another_name')
df1.equals(df2)
True
df1.index.names == df2.index.names
False
Note: using index.names
instead of index.name
makes it work for multi-indexed dataframes as well.
Upvotes: 11
Reputation: 11568
Check using: df_1.equals(df_2) # Returns True or False, details herebelow
In [45]: import numpy as np
In [46]: import pandas as pd
In [47]: np.random.seed(5)
In [48]: df_1= pd.DataFrame(np.random.randn(3,3))
In [49]: df_1
Out[49]:
0 1 2
0 0.441227 -0.330870 2.430771
1 -0.252092 0.109610 1.582481
2 -0.909232 -0.591637 0.187603
In [50]: np.random.seed(5)
In [51]: df_2= pd.DataFrame(np.random.randn(3,3))
In [52]: df_2
Out[52]:
0 1 2
0 0.441227 -0.330870 2.430771
1 -0.252092 0.109610 1.582481
2 -0.909232 -0.591637 0.187603
In [53]: df_1.equals(df_2)
Out[53]: True
In [54]: df_3= pd.DataFrame(np.random.randn(3,3))
In [55]: df_3
Out[55]:
0 1 2
0 -0.329870 -1.192765 -0.204877
1 -0.358829 0.603472 -1.664789
2 -0.700179 1.151391 1.857331
In [56]: df_1.equals(df_3)
Out[56]: False
Upvotes: 18
Reputation: 406
Not sure if this existed at the time the question was posted, but pandas now has a built-in function to test equality between two dataframes: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html.
Upvotes: 24
Reputation: 375475
You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):
csvdata_old = csvdata.copy()
To check whether they are equal, you can use assert_frame_equal as in this answer:
from pandas.util.testing import assert_frame_equal
assert_frame_equal(csvdata, csvdata_old)
You can wrap this in a function with something like:
try:
assert_frame_equal(csvdata, csvdata_old)
return True
except: # appeantly AssertionError doesn't catch all
return False
There was discussion of a better way...
Upvotes: 90