Reputation: 39
CSV file:
a,b
a,c
a,d
a,b
a,a
Code widely recommended for removing duplicates:
import fileinput
seen = set()
for line in fileinput.FileInput('1.csv', inplace=1):
if line in seen: continue
seen.add(line)
print(line)
Result obtained:
a,b
a,c
a,d
a,a
expected result:
a,b
a,c
a,d
a,a
What should I do to not create these lines during the process?
Upvotes: 3
Views: 313
Reputation: 5030
First why this happens is because you are reading a line which ends with \n
and the print()
function is also adding another \n
. If you wish to print the distinct lines only, you can print them using print(line, end='')
.
However, if you wish to remove duplicate rows, I recommend using pandas
like:
import pandas as pd
df = pd.read_csv('1.csv')
# Removing duplicate rows:
df = df.drop_duplicates()
# Saving after removing duplicates
df.to_csv('1_clean.csv', index=False)
Upvotes: 1
Reputation: 5184
The print function adds a new line at the end of each line, to change the behavior add the following argument like this.
import fileinput
seen = set()
for line in fileinput.FileInput('1.csv', inplace=1):
if line in seen: continue
seen.add(line)
print(line,end='')
There are many other ways and other libraries you can use to achieve this, this post https://www.py4u.net/discuss/16763 covers the other methods quite well, you can go through all of them and see which one works the best for you.
Upvotes: 2