Loading csv file in pandas generates duplicates

Question

I have a .csv file with 2741 rows and 279 columns :

When I tried loading that file in python using pd.read_csv() this is what I am getting :

>>> df = pd.read_csv("preprocessed_data.csv")
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882: DtypeWarning: Columns (1,2,3) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)

>>> df.shape
(18696, 279)

Clearly number of rows has gone from 2741 to 18696 which is absurd.

So I checked duplicate values as following:

>>> df[df.duplicated()].shape
(15987, 279)

Which means out of those 18696 rows, 15987 rows have duplicates present. So why these duplicates are coming after loading that csv file and how to resolve this?

Loading csv file in pandas generates duplicates

Answers (1)

Related Questions