Reputation: 1061
I have a data set which more more than 100mb in size and also many in number of files. These files have more than 20 columns and about more than 1 million rows.
The main problem with data is:
Without bothering about the which column or how many columns .. only need to Keep the first occurrence and then remove the rest.
I did find too many examples but what I am looking for is the input and output both need to be same file. The only reason to seek help is, I want the same file to be edited.
sample Input: Here https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=0
Appreciate the help thanks in advance..
Upvotes: 1
Views: 2740
Reputation: 6892
If the number of duplicate headers is known and constant, skip those rows:
csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1', skiprows=4)
Alternatively, which comes w/ the bonus of removing all duplicates, based on all columns, do this:
csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1')
csv = csv.drop_duplicates()
Now you still have a header line in the data, just skip it:
csv = csv.iloc[1:]
You certainly can then overwrite the input file with pandas.DataFrame.to_csv
Upvotes: 2