Gravity Mass
Gravity Mass

Reputation: 605

How to remove duplicate rows from CSV?

Data:

    112343  The data point was created on 1903.
    112344  The data point was created on 1909.
    112345  The data point was created on 1919.
    112346  The data point was created on 1911.
    112346  The data point was created on 1911-12.
    112346  The data point was created on 1911-12.
    112347  The data point was created on 1911.
    112348  The data point was created on 1911.

Here duplicates are the id. I want the duplicates to be removed, but I want to keep the longest row[1] (as shown in the ideal output).

Here is what I tried:

import sys
import csv
import re
import string

df = csv.reader(‘fil.csv’, delimiter = ',')

for r in df:
    dup = next(df)
    if r[0] == dup[0]:
        r[1] < dup[1]: #I am checking if the text is larger then the previous
            print dup[0], dup[1]
    else:
        print r[0], r[1]

But I am getting the output as,

112343  The data point was created on 1903.
112346  The data point was created on 1911-12.
112346  The data point was created on 1911-12.
112346  The data point was created on 1911.
112348  The data point was created on 1911.

The rows are missing!

The ideal output would be

112343  The data point was created on 1903.
112344  The data point was created on 1909.
112345  The data point was created on 1919.
112346  The data point was created on 1911-12.
112347  The data point was created on 1911.
112348  The data point was created on 1911.

How can this be accomplished? What condition or keyword can I use? Or can I have two duplicate files and compare the rows between them to eliminate duplicates?

Upvotes: 1

Views: 2849

Answers (7)

Gravity Mass
Gravity Mass

Reputation: 605

This is how I removed the duplicates.

First, I removed duplicates through Excel. But there were still some other duplicates with different column sizes (same id but different length for row[1]). In the duplicated pair of rows, I want the rows that have larger second column (len(row[1] is higher). Here is what I did,

import csv
import sys
dfo = open('fil.csv', 'rU')
df = csv.reader(dfo, delimiter = ',')

temp = ''
temp1 = ''

for r in reversed(list(df)):
    if r[0] == temp:
        continue
    elif len(r[1]) > len(temp1):
            print r[0] + '|' + r[1] + '|' + r[2] + '|' + r[3]
            #I used | for the csv separation. 
    else:
        print r[0] + '|' + r[1] + '|' + r[2] + '|' + r[3]

    temp = r[0]
    temp1 = r[1]

This took care of the duplicates. Here I basically skip the duplicate row with lesser sized r[1]. Now I got the reversed list printed out. I saved it in a csv file, and printed this new file in reverse again (restoring the original order). It solved the problem.

Upvotes: 1

gboffi
gboffi

Reputation: 25023

I'm working on the (not unreasonable) assumption that your data is always sorted on id.

The initialization

from sys import maxint
prev_id = maxint
longest = ""
data = open('myfile.dat')

The loop on data

for row in data:
    curr_id = int(row.split()[0])
    if prev_id < curr_id:
        print longest
        longest = row
    elif len(row)>len(longest): 
        longest = row
    prev_id = curr_id
# here we have still one row to  output
print longest

The relative merit of this answer consists in its memory efficiency, as rows are processed one by one. Of course this efficiency depends on the order I assumed in the data file!

Upvotes: 1

Hooting
Hooting

Reputation: 1711

The reason why your code skip lines is because the next function. In my solution, I first read in all lines into a list, then sort the list by the second column, if the first column value is the same, we just keep the first row, and skip others.

import csv
from operator import itemgetter
with open('file.csv', 'rb') as f:
    reader = csv.reader(f)
    your_list = list(reader)

your_list.sort(key=itemgetter(1)) # sorted by the second column
result = [your_list[0]] # to store the filtered results
for index in range(1,len(your_list)):
    if your_list[index] != your_list[index-1][0]:
        result.append(your_list[index])
print result

Upvotes: 0

Learner
Learner

Reputation: 5292

My solution would be-

import csv
unqkey =set()
data = []

with open("C:\data.csv") as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        unqkey.add(row[0])
        data.append(row)


unqkey = sorted(list(unqkey))

for i in unqkey:
    r=[]
    for j in data:
        if j[0]==i:
            r.append(' '.join(j))
            r.sort(key=len)
    print r[-1]

it prints-

112343  The data point was created on 1903.
112344  The data point was created on 1909.
112345  The data point was created on 1919.
112346  The data point was created on 1911-12.
112347  The data point was created on 1911.
112348  The data point was created on 1911.

Upvotes: 1

Termi
Termi

Reputation: 661

Try this:

some_dict = {}
file_name = "sample.csv"
with open(file_name) as f:
    data = csv.reader(f,delimiter = ' ')
    for row in data:
        key = row.pop(0)
        if key in some_dict:
            if len(row[0])>len(some_dict[key]):
                some_dict[key] = row.pop(0)
        else:
            some_dict[key] = row.pop(0)

for key,value in some_dict.iteritems():
    print key,value

Upvotes: 1

Konstantin
Konstantin

Reputation: 25329

My attempt:

import csv
import collections

csv_input = """    112343,  The data point was created on 1903.
    112344,  The data point was created on 1909.
    112345,  The data point was created on 1919.
    112346,  The data point was created on 1911.
    112346,  The data point was created on 1911-12.
    112346,  The data point was created on 1911-12.
    112347,  The data point was created on 1911.
    112348,  The data point was created on 1911."""

reader = csv.reader(csv_input.split('\n'))    

result = collections.OrderedDict()
for row_id, data in reader:
    if len(result.get(row_id, ''))<len(data):
        result[row_id] = data

for row_id, data in result.items():
    print "{},{}".format(row_id, data)

Upvotes: 1

JSideris
JSideris

Reputation: 5261

How to remove duplicate rows from CSV?

Open the CSV in excel. Excel has a built-in tool that allows you to remove duplicates. Follow this tutorial for more info.

Upvotes: 0

Related Questions