Resultados Oficiais
Resultados Oficiais

Reputation: 39

How to remove duplicate lines without creating empty lines in CSV file?

CSV file:

a,b
a,c
a,d
a,b
a,a

Code widely recommended for removing duplicates:

import fileinput
seen = set()
for line in fileinput.FileInput('1.csv', inplace=1):
    if line in seen: continue

    seen.add(line)
    print(line)

Result obtained:

a,b

a,c

a,d

a,a

expected result:

a,b
a,c
a,d
a,a

What should I do to not create these lines during the process?

Upvotes: 3

Views: 313

Answers (2)

aminrd
aminrd

Reputation: 5030

First why this happens is because you are reading a line which ends with \n and the print() function is also adding another \n. If you wish to print the distinct lines only, you can print them using print(line, end='').

However, if you wish to remove duplicate rows, I recommend using pandas like:

import pandas as pd
df = pd.read_csv('1.csv')

# Removing duplicate rows: 
df = df.drop_duplicates()

# Saving after removing duplicates
df.to_csv('1_clean.csv', index=False)

Upvotes: 1

anarchy
anarchy

Reputation: 5184

The print function adds a new line at the end of each line, to change the behavior add the following argument like this.

import fileinput
seen = set()
for line in fileinput.FileInput('1.csv', inplace=1):
    if line in seen: continue

    seen.add(line)
    print(line,end='')

There are many other ways and other libraries you can use to achieve this, this post https://www.py4u.net/discuss/16763 covers the other methods quite well, you can go through all of them and see which one works the best for you.

Upvotes: 2

Related Questions