Reputation: 15
I'm trying to concatenate two Pandas DataFrames and then drop the duplicates, however, for some reason drop_duplicates
doesn't work for most of the identical rows (only few of them dropped). For instance these two are identical (at least in my eyes) but they are still showing up: Identical rows here
This is the code I have tried, the result with or without subset arguments varies but still doesn't give me the result I wanted. It tends to over delete and under delete when I play with arguments (for instance add or remove columns)
bigdata = pd.concat([df_q,df_q_temp]).drop_duplicates(subset=['Date', 'Value'], keep ='first').reset_index(drop=True)
Can anyone point me to a right direction?
Thanks
Upvotes: 1
Views: 625
Reputation: 1104
Take care of string columns with no values.
Make sure in the dataframe the cells without value are read as None
. Specially, in the object
typed columns, there spaces with are different from None
, but actually there is no value there.
For example:
import pandas as pd
df = pd.DataFrame({'Col_1': ['one', ' ', 'two', None],
'Col_2': [1, 2, 3, 2],
'Col_3': ['one', None, 'two', ' ']})
df
Col_1 Col_2 Col_3
0 one 1 one
1 2 None
2 two 3 two
3 None 2
As you see row 1 and row 3 don't have value in Col_1
and Col_2
. But, since two of them are None
and the two others are spaces, they are different.
I had the same problem and struggled a lot with the code, since I found this. I solved it by replacing None
values with spaces:
df = df.fillna(' ')
Upvotes: 1
Reputation: 26221
Expanding on my comment, here is a way to make the differences explicit and normalize your df
to drop near-duplicates:
def eq_nan(a, b):
return (a == b) | ((a != a) & (b != b)) # treat NaN as equal
Let's try with some data:
df = pd.DataFrame([
['foo\u00a0bar', 1, np.nan, None, 4.00000000001, pd.Timestamp('2021-01-01')],
['foo bar', 1, np.nan, None, 4, '2021-01-01 00:00:00']],
columns=list('uvwxyz'),
)
df.loc[1, 'z'] = str(df.loc[1, 'z']) # the init above converts the second date (str) as Timestamp
>>> df.dtypes
u object
v int64
w float64
x object
y float64
z object
dtype: object
>>> df.drop_duplicates()
u v w x y z
0 foo bar 1 NaN None 4.0 2021-01-01 00:00:00
1 foo bar 1 NaN None 4.0 2021-01-01 00:00:00
Find what elements among those two rows are different:
a = df.loc[0]
b = df.loc[1]
diff = ~eq_nan(a, b)
for (col, x), y in zip(a[diff].iteritems(), b[diff]):
print(f'{col}:\t{x!r} != {y!r}')
# output:
u: 'foo\xa0bar' != 'foo bar'
y: 4.00000000001 != 4.0
z: Timestamp('2021-01-01 00:00:00') != '2021-01-01 00:00:00'
Side note: alternatively, if you have cells containing complex types, e.g. list
, dict
, etc., you may use pytest
(outside of testing) to get some nice verbose explanation of exactly how the values differ:
from _pytest.assertion.util import _compare_eq_verbose
for (col, x), y in zip(a[diff].iteritems(), b[diff]):
da, db = _compare_eq_verbose(x, y)
print(f'{col}:\t{da} != {db}')
# Output:
u: +'foo\xa0bar' != -'foo bar'
y: +4.00000000001 != -4.0
z: +Timestamp('2021-01-01 00:00:00') != -'2021-01-01 00:00:00'
We use Pandas' own Series formatter to convert each row into a string representation:
def normalize_row(r):
vals = r.to_string(header=False, index=False, name=False).splitlines()
vals = [
' '.join(s.strip().split()) # transform any whitespace (e.g. unicode non-breaking space) into ' '
for s in vals
]
return vals
Example for the first row above:
>>> normalize_row(df.iloc[0])
['foo bar', '1', 'NaN', 'NaN', '4.0', '2021-01-01 00:00:00']
Usage to drop visually identical duplicates:
newdf = df.loc[df.apply(normalize_row, axis=0).drop_duplicates().index]
>>> newdf
u v w x y z
0 foo bar 1 NaN None 4.0 2021-01-01 00:00:00
>>> newdf.dtypes
u object
v int64
w float64
x object
y float64
z object
dtype: object
Note: the rows that make it through this filter are copied exactly into newdf
(not the string lists that were used for near-duplicate detection).
Upvotes: 2