Sitz Blogz
Sitz Blogz

Reputation: 1061

Remove Duplicate rows from csv [headers + Content]

I have a data set which more more than 100mb in size and also many in number of files. These files have more than 20 columns and about more than 1 million rows.

The main problem with data is:

  1. Headers are repeating -- Duplicate header rows
  2. Duplicate rows in full i.e. data from all the columns in that particular row is duplicate.

Without bothering about the which column or how many columns .. only need to Keep the first occurrence and then remove the rest.

I did find too many examples but what I am looking for is the input and output both need to be same file. The only reason to seek help is, I want the same file to be edited.

sample Input: Here https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=0

Appreciate the help thanks in advance..

Upvotes: 1

Views: 2740

Answers (1)

TomTom101
TomTom101

Reputation: 6892

If the number of duplicate headers is known and constant, skip those rows:

csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1', skiprows=4)

Alternatively, which comes w/ the bonus of removing all duplicates, based on all columns, do this:

csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1') csv = csv.drop_duplicates()

Now you still have a header line in the data, just skip it: csv = csv.iloc[1:]

You certainly can then overwrite the input file with pandas.DataFrame.to_csv

Upvotes: 2

Related Questions