Reputation: 155
I have a .csv file with 2741 rows and 279 columns :
When I tried loading that file in python using pd.read_csv()
this is what I am getting :
>>> df = pd.read_csv("preprocessed_data.csv")
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882: DtypeWarning: Columns (1,2,3) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
>>> df.shape
(18696, 279)
Clearly number of rows has gone from 2741 to 18696 which is absurd.
So I checked duplicate values as following:
>>> df[df.duplicated()].shape
(15987, 279)
Which means out of those 18696 rows, 15987 rows have duplicates present. So why these duplicates are coming after loading that csv file and how to resolve this?
Upvotes: 0
Views: 943
Reputation: 142631
As for me all problem can be when you create these file - not when you load them.
Maybe you used .to_csv()
many times with mode append
and it added the same values many times.
At this moment you can use ~
in df[ ~df.duplicated() ]
to keep unique values
df = df[ ~df.duplicated() ]
Upvotes: 1