drop_duplicates doesn't work on multiple identical rows instances

Question

I'm trying to concatenate two Pandas DataFrames and then drop the duplicates, however, for some reason drop_duplicates doesn't work for most of the identical rows (only few of them dropped). For instance these two are identical (at least in my eyes) but they are still showing up: Identical rows here

This is the code I have tried, the result with or without subset arguments varies but still doesn't give me the result I wanted. It tends to over delete and under delete when I play with arguments (for instance add or remove columns)

bigdata = pd.concat([df_q,df_q_temp]).drop_duplicates(subset=['Date', 'Value'], keep ='first').reset_index(drop=True)

Can anyone point me to a right direction?

Thanks

Pierre D · Accepted Answer

Expanding on my comment, here is a way to make the differences explicit and normalize your df to drop near-duplicates:

Part 1: show differences

def eq_nan(a, b):
    return (a == b) | ((a != a) & (b != b))  # treat NaN as equal

Let's try with some data:

df = pd.DataFrame([
    ['foo\u00a0bar', 1, np.nan, None, 4.00000000001, pd.Timestamp('2021-01-01')],
    ['foo bar', 1, np.nan, None, 4, '2021-01-01 00:00:00']],
    columns=list('uvwxyz'),
)
df.loc[1, 'z'] = str(df.loc[1, 'z'])  # the init above converts the second date (str) as Timestamp

>>> df.dtypes
u     object
v      int64
w    float64
x     object
y    float64
z     object
dtype: object

>>> df.drop_duplicates()
         u  v   w     x    y                    z
0  foo bar  1 NaN  None  4.0  2021-01-01 00:00:00
1  foo bar  1 NaN  None  4.0  2021-01-01 00:00:00

Find what elements among those two rows are different:

a = df.loc[0]
b = df.loc[1]
diff = ~eq_nan(a, b)
for (col, x), y in zip(a[diff].iteritems(), b[diff]):
    print(f'{col}:	{x!r} != {y!r}')

# output:
u:  'foo\xa0bar' != 'foo bar'
y:  4.00000000001 != 4.0
z:  Timestamp('2021-01-01 00:00:00') != '2021-01-01 00:00:00'

Side note: alternatively, if you have cells containing complex types, e.g. list, dict, etc., you may use pytest (outside of testing) to get some nice verbose explanation of exactly how the values differ:

from _pytest.assertion.util import _compare_eq_verbose

for (col, x), y in zip(a[diff].iteritems(), b[diff]):
    da, db = _compare_eq_verbose(x, y)
    print(f'{col}:	{da} != {db}')

# Output:
u:  +'foo\xa0bar' != -'foo bar'
y:  +4.00000000001 != -4.0
z:  +Timestamp('2021-01-01 00:00:00') != -'2021-01-01 00:00:00'

Part 2: example of normalization to help drop duplicates

We use Pandas' own Series formatter to convert each row into a string representation:

def normalize_row(r):
    vals = r.to_string(header=False, index=False, name=False).splitlines()
    vals = [
        ' '.join(s.strip().split())  # transform any whitespace (e.g. unicode non-breaking space) into ' '
        for s in vals
    ]
    return vals

Example for the first row above:

>>> normalize_row(df.iloc[0])
['foo bar', '1', 'NaN', 'NaN', '4.0', '2021-01-01 00:00:00']

Usage to drop visually identical duplicates:

newdf = df.loc[df.apply(normalize_row, axis=0).drop_duplicates().index]
>>> newdf
         u  v   w     x    y                    z
0  foo bar  1 NaN  None  4.0  2021-01-01 00:00:00

>>> newdf.dtypes
u     object
v      int64
w    float64
x     object
y    float64
z     object
dtype: object

Note: the rows that make it through this filter are copied exactly into newdf (not the string lists that were used for near-duplicate detection).

drop_duplicates doesn't work on multiple identical rows instances

Answers (2)

Part 1: show differences

Part 2: example of normalization to help drop duplicates

Related Questions

drop_duplicates doesn&#39;t work on multiple identical rows instances

Answers (2)

Part 1: show differences

Part 2: example of normalization to help drop duplicates

Related Questions

drop_duplicates doesn't work on multiple identical rows instances