Reputation: 23
I'm receiving a data feed from an external source that may contain duplicate values.
hash, date, address, name
foo, 11/1/2013, 123 blah street, Jason
foo1, 11/15/2013, 15 foo road, Greg
foo, 11/2/2013, 123 blah street, Jason
The objective here is to "Dedup" by the first column (called hash) BUT I want to take the line with the "latest" date. In this case it would be:
foo, 11/2/2013, 123 blah street, Jason
I can't simply use the build in list functions (unless I totally missed one) so I'm trying to think of another way. Maybe iterate through the list somehow and compare the values?
Any help would be MUCH appreciated.
Thanks.
Upvotes: 2
Views: 113
Reputation: 20608
defaultdict
may be of some utility here. A similar class can be constructed that would accept a value if it meets your defined criteria.
from collections import defaultdict
Upvotes: 0
Reputation: 58965
Compiling the answer of @John Zwinck and the comment of @Saša Šijak) you can obtain what you want doing:
import csv
from collections import OrderedDict
hash_col = 0
date_col = 1
with open('test.txt') as f:
od = OrderedDict()
lines = []
for line in csv.reader(f):
line[date_col] = '{0:02d}/{1:02d}/{2:04d}'.format(
*map(int, line[date_col].split('/')))
lines.append(line)
lines.sort(key = lambda x: x[date_col])
for line in lines:
od[line[hash_col]] = line
Upvotes: 0
Reputation: 326
I don't recall a lot of the inbuilt list functions (which will probably be more efficient). There are several ways to solve the problem, one is this:
Method 1: Not as elegant
*Alternative *
It depends on your intended structure and performance. I haven't done any testing on the above two so I can't speak on the speed. but the hash lookup on the dictionary might be faster overall. I stand to be corrected, though
Edit: Comparing dates Take the date string and split it by the delimiter ('\', '-', etc.) and then for the two dates in question, use a simple loop to check the recency, going from year down to day. Example: (pseudo code)
def newestDate(date1, date2):
result = date1;
date1array = date1.split("/");
date2array = date2.split("/");
for i in range(0,date1array.length):
if (int(date2array[i]) > int(date1array[i])):
result = date2;
break;
return result;
Of course, you should build in some error checking in this function to ensure that nothing crazy happens. Also, convert each string being compared into an integer
Upvotes: 0
Reputation: 249444
Make a dict, with the key being your "hash" column and the values being the entire lines. For each line, see if it is in the dict; if not, insert it. If it is there, overwrite the existing value if the date is later.
If you need the output to be ordered similarly to the input, use an OrderedDict instead of a regular one.
Upvotes: 4