Reputation: 23

Python: Dedup a list based on certain critera

I'm receiving a data feed from an external source that may contain duplicate values.

hash, date, address, name

foo, 11/1/2013, 123 blah street, Jason
foo1, 11/15/2013, 15 foo road, Greg
foo, 11/2/2013, 123 blah street, Jason

The objective here is to "Dedup" by the first column (called hash) BUT I want to take the line with the "latest" date. In this case it would be:

foo, 11/2/2013, 123 blah street, Jason

I can't simply use the build in list functions (unless I totally missed one) so I'm trying to think of another way. Maybe iterate through the list somehow and compare the values?

Any help would be MUCH appreciated.

Thanks.

Upvotes: 2

Answers (4)

kzh

Reputation: 20608

defaultdict may be of some utility here. A similar class can be constructed that would accept a value if it meets your defined criteria.

from collections import defaultdict

Upvotes: 0

Saullo G. P. Castro

Reputation: 58965

Compiling the answer of @John Zwinck and the comment of @Saša Šijak) you can obtain what you want doing:

import csv
from collections import OrderedDict

hash_col = 0
date_col = 1  

with open('test.txt') as f:
    od = OrderedDict()
    lines = []
    for line in csv.reader(f):
        line[date_col] = '{0:02d}/{1:02d}/{2:04d}'.format(
                             *map(int, line[date_col].split('/')))
        lines.append(line)
    lines.sort(key = lambda x: x[date_col])
    for line in lines:
        od[line[hash_col]] = line

Upvotes: 0

Daniel Kotin

Reputation: 326

I don't recall a lot of the inbuilt list functions (which will probably be more efficient). There are several ways to solve the problem, one is this:

Method 1: Not as elegant

Build a structure that holds each line that comes in (eg, a list). Since you'll be searching by column you could build a List of Lists where each element (accessed by a row, column co-ordinate) represents a word (using comma-separation).
Loop through the first column and on the structure (so, 'hash' column)
When you get a match, step into the next column (in this case, date) and check for duplicates there
If one is found, check the date string and select the one with the newer date.

*Alternative *

Build a dictionary with the key being the hash and the value being a list of the other words [date, address, name].
Wrap this dictionary in your own which, on insertion of a new key-value-pair, check the 'date' member and if the incoming date is newer, replace the old one. Else, ignore the new line.

It depends on your intended structure and performance. I haven't done any testing on the above two so I can't speak on the speed. but the hash lookup on the dictionary might be faster overall. I stand to be corrected, though

Edit: Comparing dates Take the date string and split it by the delimiter ('\', '-', etc.) and then for the two dates in question, use a simple loop to check the recency, going from year down to day. Example: (pseudo code)

def newestDate(date1, date2):
    result = date1;
    date1array = date1.split("/");
    date2array = date2.split("/");
    for i in range(0,date1array.length):
     if (int(date2array[i]) > int(date1array[i])):
          result = date2;
          break;
    return result;

Of course, you should build in some error checking in this function to ensure that nothing crazy happens. Also, convert each string being compared into an integer

Upvotes: 0

John Zwinck

Reputation: 249444

Make a dict, with the key being your "hash" column and the values being the entire lines. For each line, see if it is in the dict; if not, insert it. If it is there, overwrite the existing value if the date is later.

If you need the output to be ordered similarly to the input, use an OrderedDict instead of a regular one.

Upvotes: 4

Python: Dedup a list based on certain critera

Answers (4)

Related Questions