S. AG
S. AG

Reputation: 15

drop_duplicates doesn't work on multiple identical rows instances

I'm trying to concatenate two Pandas DataFrames and then drop the duplicates, however, for some reason drop_duplicates doesn't work for most of the identical rows (only few of them dropped). For instance these two are identical (at least in my eyes) but they are still showing up: Identical rows here

This is the code I have tried, the result with or without subset arguments varies but still doesn't give me the result I wanted. It tends to over delete and under delete when I play with arguments (for instance add or remove columns)

bigdata = pd.concat([df_q,df_q_temp]).drop_duplicates(subset=['Date', 'Value'], keep ='first').reset_index(drop=True)

Can anyone point me to a right direction?

Thanks

Upvotes: 1

Views: 625

Answers (2)

Maryam Bahrami
Maryam Bahrami

Reputation: 1104

Take care of string columns with no values. Make sure in the dataframe the cells without value are read as None. Specially, in the object typed columns, there spaces with are different from None, but actually there is no value there.

For example:

import pandas as pd

df = pd.DataFrame({'Col_1': ['one', ' ', 'two', None],
                   'Col_2': [1, 2, 3, 2],
                   'Col_3': ['one', None, 'two', ' ']})
df

    Col_1   Col_2   Col_3
0   one      1      one
1            2      None
2   two      3      two
3   None     2      

As you see row 1 and row 3 don't have value in Col_1 and Col_2. But, since two of them are None and the two others are spaces, they are different.

I had the same problem and struggled a lot with the code, since I found this. I solved it by replacing None values with spaces:

df = df.fillna(' ')

Upvotes: 1

Pierre D
Pierre D

Reputation: 26221

Expanding on my comment, here is a way to make the differences explicit and normalize your df to drop near-duplicates:

Part 1: show differences

def eq_nan(a, b):
    return (a == b) | ((a != a) & (b != b))  # treat NaN as equal

Let's try with some data:

df = pd.DataFrame([
    ['foo\u00a0bar', 1, np.nan, None, 4.00000000001, pd.Timestamp('2021-01-01')],
    ['foo bar', 1, np.nan, None, 4, '2021-01-01 00:00:00']],
    columns=list('uvwxyz'),
)
df.loc[1, 'z'] = str(df.loc[1, 'z'])  # the init above converts the second date (str) as Timestamp
>>> df.dtypes
u     object
v      int64
w    float64
x     object
y    float64
z     object
dtype: object
>>> df.drop_duplicates()
         u  v   w     x    y                    z
0  foo bar  1 NaN  None  4.0  2021-01-01 00:00:00
1  foo bar  1 NaN  None  4.0  2021-01-01 00:00:00

Find what elements among those two rows are different:

a = df.loc[0]
b = df.loc[1]
diff = ~eq_nan(a, b)
for (col, x), y in zip(a[diff].iteritems(), b[diff]):
    print(f'{col}:\t{x!r} != {y!r}')

# output:
u:  'foo\xa0bar' != 'foo bar'
y:  4.00000000001 != 4.0
z:  Timestamp('2021-01-01 00:00:00') != '2021-01-01 00:00:00'

Side note: alternatively, if you have cells containing complex types, e.g. list, dict, etc., you may use pytest (outside of testing) to get some nice verbose explanation of exactly how the values differ:

from _pytest.assertion.util import _compare_eq_verbose

for (col, x), y in zip(a[diff].iteritems(), b[diff]):
    da, db = _compare_eq_verbose(x, y)
    print(f'{col}:\t{da} != {db}')

# Output:
u:  +'foo\xa0bar' != -'foo bar'
y:  +4.00000000001 != -4.0
z:  +Timestamp('2021-01-01 00:00:00') != -'2021-01-01 00:00:00'

Part 2: example of normalization to help drop duplicates

We use Pandas' own Series formatter to convert each row into a string representation:

def normalize_row(r):
    vals = r.to_string(header=False, index=False, name=False).splitlines()
    vals = [
        ' '.join(s.strip().split())  # transform any whitespace (e.g. unicode non-breaking space) into ' '
        for s in vals
    ]
    return vals

Example for the first row above:

>>> normalize_row(df.iloc[0])
['foo bar', '1', 'NaN', 'NaN', '4.0', '2021-01-01 00:00:00']

Usage to drop visually identical duplicates:

newdf = df.loc[df.apply(normalize_row, axis=0).drop_duplicates().index]
>>> newdf
         u  v   w     x    y                    z
0  foo bar  1 NaN  None  4.0  2021-01-01 00:00:00
​
>>> newdf.dtypes
u     object
v      int64
w    float64
x     object
y    float64
z     object
dtype: object

Note: the rows that make it through this filter are copied exactly into newdf (not the string lists that were used for near-duplicate detection).

Upvotes: 2

Related Questions