Reputation: 605
Data:
112343 The data point was created on 1903.
112344 The data point was created on 1909.
112345 The data point was created on 1919.
112346 The data point was created on 1911.
112346 The data point was created on 1911-12.
112346 The data point was created on 1911-12.
112347 The data point was created on 1911.
112348 The data point was created on 1911.
Here duplicates are the id. I want the duplicates to be removed, but I want to keep the longest row[1] (as shown in the ideal output).
Here is what I tried:
import sys
import csv
import re
import string
df = csv.reader(‘fil.csv’, delimiter = ',')
for r in df:
dup = next(df)
if r[0] == dup[0]:
r[1] < dup[1]: #I am checking if the text is larger then the previous
print dup[0], dup[1]
else:
print r[0], r[1]
But I am getting the output as,
112343 The data point was created on 1903.
112346 The data point was created on 1911-12.
112346 The data point was created on 1911-12.
112346 The data point was created on 1911.
112348 The data point was created on 1911.
The rows are missing!
The ideal output would be
112343 The data point was created on 1903.
112344 The data point was created on 1909.
112345 The data point was created on 1919.
112346 The data point was created on 1911-12.
112347 The data point was created on 1911.
112348 The data point was created on 1911.
How can this be accomplished? What condition or keyword can I use? Or can I have two duplicate files and compare the rows between them to eliminate duplicates?
Upvotes: 1
Views: 2849
Reputation: 605
This is how I removed the duplicates.
First, I removed duplicates through Excel. But there were still some other duplicates with different column sizes (same id but different length for row[1]). In the duplicated pair of rows, I want the rows that have larger second column (len(row[1] is higher). Here is what I did,
import csv
import sys
dfo = open('fil.csv', 'rU')
df = csv.reader(dfo, delimiter = ',')
temp = ''
temp1 = ''
for r in reversed(list(df)):
if r[0] == temp:
continue
elif len(r[1]) > len(temp1):
print r[0] + '|' + r[1] + '|' + r[2] + '|' + r[3]
#I used | for the csv separation.
else:
print r[0] + '|' + r[1] + '|' + r[2] + '|' + r[3]
temp = r[0]
temp1 = r[1]
This took care of the duplicates. Here I basically skip the duplicate row with lesser sized r[1]. Now I got the reversed list printed out. I saved it in a csv file, and printed this new file in reverse again (restoring the original order). It solved the problem.
Upvotes: 1
Reputation: 25023
I'm working on the (not unreasonable) assumption that your data is always sorted on id
.
The initialization
from sys import maxint
prev_id = maxint
longest = ""
data = open('myfile.dat')
The loop on data
for row in data:
curr_id = int(row.split()[0])
if prev_id < curr_id:
print longest
longest = row
elif len(row)>len(longest):
longest = row
prev_id = curr_id
# here we have still one row to output
print longest
The relative merit of this answer consists in its memory efficiency, as rows are processed one by one. Of course this efficiency depends on the order I assumed in the data file!
Upvotes: 1
Reputation: 1711
The reason why your code skip lines is because the next
function. In my solution, I first read in all lines into a list
, then sort the list
by the second column, if the first column value is the same, we just keep the first row, and skip others.
import csv
from operator import itemgetter
with open('file.csv', 'rb') as f:
reader = csv.reader(f)
your_list = list(reader)
your_list.sort(key=itemgetter(1)) # sorted by the second column
result = [your_list[0]] # to store the filtered results
for index in range(1,len(your_list)):
if your_list[index] != your_list[index-1][0]:
result.append(your_list[index])
print result
Upvotes: 0
Reputation: 5292
My solution would be-
import csv
unqkey =set()
data = []
with open("C:\data.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
unqkey.add(row[0])
data.append(row)
unqkey = sorted(list(unqkey))
for i in unqkey:
r=[]
for j in data:
if j[0]==i:
r.append(' '.join(j))
r.sort(key=len)
print r[-1]
it prints-
112343 The data point was created on 1903.
112344 The data point was created on 1909.
112345 The data point was created on 1919.
112346 The data point was created on 1911-12.
112347 The data point was created on 1911.
112348 The data point was created on 1911.
Upvotes: 1
Reputation: 661
Try this:
some_dict = {}
file_name = "sample.csv"
with open(file_name) as f:
data = csv.reader(f,delimiter = ' ')
for row in data:
key = row.pop(0)
if key in some_dict:
if len(row[0])>len(some_dict[key]):
some_dict[key] = row.pop(0)
else:
some_dict[key] = row.pop(0)
for key,value in some_dict.iteritems():
print key,value
Upvotes: 1
Reputation: 25329
My attempt:
import csv
import collections
csv_input = """ 112343, The data point was created on 1903.
112344, The data point was created on 1909.
112345, The data point was created on 1919.
112346, The data point was created on 1911.
112346, The data point was created on 1911-12.
112346, The data point was created on 1911-12.
112347, The data point was created on 1911.
112348, The data point was created on 1911."""
reader = csv.reader(csv_input.split('\n'))
result = collections.OrderedDict()
for row_id, data in reader:
if len(result.get(row_id, ''))<len(data):
result[row_id] = data
for row_id, data in result.items():
print "{},{}".format(row_id, data)
Upvotes: 1
Reputation: 5261
How to remove duplicate rows from CSV?
Open the CSV in excel. Excel has a built-in tool that allows you to remove duplicates. Follow this tutorial for more info.
Upvotes: 0