Krasig
Krasig

Reputation: 73

Getting number of differences between 2 dataframes in python

I want to compare two dataframes (same number of rows and columns in both) using python and to get number of differences, what would be the best way for this?

def numberOfDifferencess(df1, df2):
    if df1.equals(df2):
        numberOfDifferences = 0
    else:
        ?????

Upvotes: 4

Views: 1681

Answers (3)

Andrew L
Andrew L

Reputation: 7038

Here's one way:

df
    a    b
0   1  999
1   2    3
2   3  345
3  56    8
4   7   54
df_b
    a    b
0   1  111
1   2    3
2   3  345
3  56    8
4   7   54

Comparing:

df.count().sum() - (df == df_b).astype(int).sum().sum()
1 #this is the number of differences

In a function:

def numberOfDifferencess(df1, df2):
    return df1.count().sum() - (df1 == df2).astype(int).sum().sum()

Essentially (df == df_b).astype(int).sum().sum() will sum up the overlap (field in one equals field in another) between the two dataframes.

Quick Speed Test

df1 = pd.DataFrame(np.random.randint(0, 100, size = (1000,1000)))
df2 = pd.DataFrame(np.random.randint(0, 100, size = (1000,1000)))

%timeit numberOfDifferencess(df1, df2)
%timeit number_of_diff(df1, df2) # using spies006 function for comparison (see below)

10 loops, best of 3: 20.6 ms per loop
1 loop, best of 3: 428 ms per loop

Not surprisingly, this approach is ideal. Iterating over a dataframe is generally not the most efficient approach.

Upvotes: 3

seorc
seorc

Reputation: 436

You could use the underlying ndarrays functionality for this:

from pandas import DataFrame

df = DataFrame(data=[
    [1, 2, 3, 4],
    [6, 7, 8, 4],
    [1, 2, 3, 2]])

dfd = DataFrame(data=[
    [1, 2, 1, 4],
    [6, 9, 8, 4],
    [1, 1, 3, 2]])

diff = df.values != dfd.values

result = diff.flatten().sum()

Upvotes: 0

spies006
spies006

Reputation: 2927

>>> df1
   a  b
0  1  1
1  2  2
2  3  4


>>> df2
   a  b
0  1  1
1  2  2
2  8  4

Here is one way to do it, I've just built off of what you already have. I use loc to iterate of the each of the rows in df1 and df2.

>>> numberOfDifferences = 0
>>> for i in range(len(df1)):
...     if not df1.loc[i, :].equals(df2.loc[i, :]): 
...             numberOfDifferences+=1
... 
>>> numberOfDifferences
1

If you'd like it as a function as implied, it follows.

def number_of_diff(df1, df2):
    differences = 0
    for i in range(len(df1)):
        if not df1.loc[i, :].equals(df2.loc[i, :]):
            differences += 1
    return differences

Upvotes: 1

Related Questions