Christopher
Christopher

Reputation: 691

Python Dictionary Multiple Keys, Search Function

I have the following sample CSV named results1111.csv:

Master #,Scrape,Date of Transaction
2C7E4B,6854585658,5/2/2007
2C7E4B,8283876134,5/8/2007
2C7E4B,4258586585,5/18/2007
C585ED,5554541212,5/18/2004
585868,5555551214,8/16/2012

I have the following code which opens the CSV and then puts the data into multiple dictionaries:

with open('c:\\results1111.csv', "r") as f:
    f.next()
    reader = csv.reader(f)
    result = {}
    for row in reader:
        key = row[0]
        result[key] = row[1:]
        values = row[1:]
        telnumber = row[1]
        transdate = row[2]
#print key
#print values
#print telnumber
#print transdate
#print result

        d = {}
        d.setdefault(key, []).append(values)
        print d

The output of the above code is:

{'2C7E4B': [['6854585658', '5/2/2007']]}
{'2C7E4B': [['8283876134', '5/8/2007']]}
{'2C7E4B': [['4258586585', '5/18/2007']]}
{'C585ED': [['5554541212', '5/18/2004']]}
{'585868': [['5555551214', '8/16/2012']]}

I would like to search the dictionaries for any instance where the same key has multiple phone numbers tied to it, such as the first three entries in the output above. When that happens, I would then like to remove the dictionary with the earliest date. I would then like to output all of the remaining dictionaries back into a CSV. The output should look like this:

2C7E4B,8283876134,5/8/2007
2C7E4B,4258586585,5/18/2007
C585ED,5554541212,5/18/2004
585868,5555551214,8/16/2012

Since there are thousands of keys (in the real input csv), I am not sure how to write a statement to do this. Any help is appreciated.

Upvotes: 0

Views: 255

Answers (2)

Reut Sharabani
Reut Sharabani

Reputation: 31339

Here is an idea of what you probably want. the basic idea is to aggregate the dates under the same key, and finally clearing the earliest dated entry.

#!/usr/bin/env python
import csv
import datetime

# this is used to parse the dates, you can change this if you change the format
DATE_FORMAT = "%m/%d/%Y"

# this is a dates comparator, to sort the dates when removing earliest date
def compare_dates(date1, date2):
    d1 = datetime.datetime.strptime(date1, DATE_FORMAT)
    d2 = datetime.datetime.strptime(date2, DATE_FORMAT)
    return int((d1 - d2).total_seconds())

with open('res.csv', "r") as f:
    f.next()
    reader = csv.reader(f)
    result = {}

    for row in reader:
        key = row[0]
        telnumber = row[1]
        transdate = row[2]
        # if it's a new key, we will need a new lst for the aggregation
        if not key in result:
            result[key] = []
        # thisis where we aggregate the all same-key entries
        result[key].append([telnumber, transdate,])

# this function takes in a key-value from the dictionary,
# and returns the earliest entry from the value (value being a list)
def clear_list(kv):
    k, v = kv
    if len(v) > 1:
        return {k: sorted(v, lambda x, y: compare_dates(x[1], y[1]))[1:]}
    return {k: v}

# simply clears all entries we've aggregated under each key.
print map(clear_list, result.items())

# ... now write back to csv

Upvotes: 1

tdelaney
tdelaney

Reputation: 77357

You'll need to sort all of the reocrds for a single master by date, which is more easily done with a list than a dict. Since month/day/year date doesn't sort correctly without some sort of conversion, I create a datetime object as the first item of the record. Now the list will sort by date (and if two records have the same date, by telephone number) so it's just a question of finding, sorting and deleting items from the list.

import csv
import collections
import datetime as dt

open('temp.csv', 'w').write("""Master #,Scrape,Date of Transaction
2C7E4B,6854585658,5/2/2007
2C7E4B,8283876134,5/8/2007
2C7E4B,4258586585,5/18/2007
C585ED,5554541212,5/18/2004
585868,5555551214,8/16/2012
""")

with open('temp.csv') as f:
    f.next()
    reader = csv.reader(f)
    # map master to list of transactions
    result = collections.defaultdict(list)
    for row in reader:
        key = row[0]
        # make date sortable
        sortable_date = dt.datetime.strptime(row[2], '%m/%d/%Y')
        result[key].append([sortable_date, row[1], row[2]])

for value in result.values():
    # discard old records
    if len(value) > 1:
        value.sort()
        del value[0]
        # or to delete all but the last one
        # del value[:-1]

keys = result.keys()
keys.sort()

for key in keys:
    transactions = result[key]
    for transaction in transactions:
        print key, transaction[1], transaction[2]

Upvotes: 1

Related Questions