Reputation: 73
I want to compare two dataframes (same number of rows and columns in both) using python and to get number of differences, what would be the best way for this?
def numberOfDifferencess(df1, df2):
if df1.equals(df2):
numberOfDifferences = 0
else:
?????
Upvotes: 4
Views: 1681
Reputation: 7038
Here's one way:
df
a b
0 1 999
1 2 3
2 3 345
3 56 8
4 7 54
df_b
a b
0 1 111
1 2 3
2 3 345
3 56 8
4 7 54
Comparing:
df.count().sum() - (df == df_b).astype(int).sum().sum()
1 #this is the number of differences
In a function:
def numberOfDifferencess(df1, df2):
return df1.count().sum() - (df1 == df2).astype(int).sum().sum()
Essentially (df == df_b).astype(int).sum().sum()
will sum up the overlap (field in one equals field in another) between the two dataframes.
Quick Speed Test
df1 = pd.DataFrame(np.random.randint(0, 100, size = (1000,1000)))
df2 = pd.DataFrame(np.random.randint(0, 100, size = (1000,1000)))
%timeit numberOfDifferencess(df1, df2)
%timeit number_of_diff(df1, df2) # using spies006 function for comparison (see below)
10 loops, best of 3: 20.6 ms per loop
1 loop, best of 3: 428 ms per loop
Not surprisingly, this approach is ideal. Iterating over a dataframe is generally not the most efficient approach.
Upvotes: 3
Reputation: 436
You could use the underlying ndarrays functionality for this:
from pandas import DataFrame
df = DataFrame(data=[
[1, 2, 3, 4],
[6, 7, 8, 4],
[1, 2, 3, 2]])
dfd = DataFrame(data=[
[1, 2, 1, 4],
[6, 9, 8, 4],
[1, 1, 3, 2]])
diff = df.values != dfd.values
result = diff.flatten().sum()
Upvotes: 0
Reputation: 2927
>>> df1
a b
0 1 1
1 2 2
2 3 4
>>> df2
a b
0 1 1
1 2 2
2 8 4
Here is one way to do it, I've just built off of what you already have. I use loc
to iterate of the each of the rows in df1
and df2
.
>>> numberOfDifferences = 0
>>> for i in range(len(df1)):
... if not df1.loc[i, :].equals(df2.loc[i, :]):
... numberOfDifferences+=1
...
>>> numberOfDifferences
1
If you'd like it as a function as implied, it follows.
def number_of_diff(df1, df2):
differences = 0
for i in range(len(df1)):
if not df1.loc[i, :].equals(df2.loc[i, :]):
differences += 1
return differences
Upvotes: 1