Granitosaurus
Granitosaurus

Reputation: 21446

Filtering out csv rows by column data

I'm not sure how to call this but I have an csv with the data:

...|Address    | Date       |...  
...|Abraham st.| 01/01/2008 |...  
...|Abraham st.| 02/02/2007 |...  
...|Abraham st.| 03/03/2011|...  

so what I want to do is only keep the newest entry(in this case it would be row4), I'm really having trouble bending my mind around this.

My initial idea is to read the data from csv to list of rows and then:

  1. to convert date strings to datetime object
  2. and then go through every row, get it's name and do a comparison with every other row to find the highest date and save the date's row.

is there a better way to approach this?

Upvotes: 1

Views: 1456

Answers (3)

mhawke
mhawke

Reputation: 87154

Just use the max builtin with a key function that extracts and converts the date field into a datetime object. I assume that your dates are mm/dd/yyyy.

import csv
from datetime import datetime

DATE_COLUMN = 1
with open('input.csv') as f:
    reader = csv.reader(f, delimiter='|')
    next(reader)    # skip over the CSV header row
    most_recent = max(reader, key=lambda x : datetime.strptime(x[DATE_COLUMN].strip(), '%d/%m/%Y'))

>>> print most_recent
['Abraham st.', ' 03/03/2011']

I think your intent is to group by the "Address" column and select the most recent date from the "Date" column, in which case you can use itertools.groupby() like this:

import csv
from itertools import groupby
from datetime import datetime

ADDRESS_COLUMN = 0
DATE_COLUMN = 1
most_recent = []

with open('input.csv') as f:
    reader = csv.reader(f, delimiter='|')
    next(reader)    # skip over the CSV header row
    for k, g in groupby(sorted(reader), lambda x : x[ADDRESS_COLUMN]):
        most_recent.append(max(g, key=lambda x : datetime.strptime(x[DATE_COLUMN].strip(), '%d/%m/%Y')))

>>> print most_recent
[['Abraham st.', ' 03/03/2011'], ['Moses rd.', ' 10/12/2013'], ['Smith St.', ' 01/01/1999']]

Assuming input.csv contains this:

Address |Date
Abraham st.| 01/01/2008
Abraham st.| 02/02/2007
Abraham st.| 03/03/2011
Moses rd.| 10/12/2013
Moses rd.| 11/11/2011
Smith St.| 01/01/1999

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1125398

Keep track of the highest value seen so far instead; I'm assuming here you already have a csv.reader() object reading the CSV data:

from datetime import datetime

max_date = datetime.min
newest_row = None

for row in csv_reader:
    # assumption: your date is the 4th column in each row
    date = datetime.strptime(row[3], '%m/%d/%Y')
    if date > max_date:
        # row is newer, remember it
        max_date = date
        newest_row = row

When you've read the whole file, newest_row will hold the data row with the most recent date. However, the cold never holds more than 2 rows in memory (the currently row being processed, and the newest row found so far).

Note that I started max_date as datetime.min, which is the minimum value you can store in a datetime object; as long as your input file does not contain any rows for January 1st in the year 1, you should be good.

Upvotes: 2

Robert Ekendahl
Robert Ekendahl

Reputation: 277

Not sure you need to "compare with every other row" (but that might just be me misunderstanding your intent. I would simply save the currently newest row as I loop over the column.

Something like this pseudo code:

saved_row = Null
for row in table:
    if not saved_row:
        saved_row = row
    else if row.date > saved_row.date:
        saved_row = row

There is probably a more elegant way to store the initial row into saved_row

Upvotes: 0

Related Questions