Reputation: 21446
I'm not sure how to call this but I have an csv with the data:
...|Address | Date |...
...|Abraham st.| 01/01/2008 |...
...|Abraham st.| 02/02/2007 |...
...|Abraham st.| 03/03/2011|...
so what I want to do is only keep the newest entry(in this case it would be row4), I'm really having trouble bending my mind around this.
My initial idea is to read the data from csv to list of rows and then:
is there a better way to approach this?
Upvotes: 1
Views: 1456
Reputation: 87154
Just use the max
builtin with a key
function that extracts and converts the date field into a datetime
object. I assume that your dates are mm/dd/yyyy.
import csv
from datetime import datetime
DATE_COLUMN = 1
with open('input.csv') as f:
reader = csv.reader(f, delimiter='|')
next(reader) # skip over the CSV header row
most_recent = max(reader, key=lambda x : datetime.strptime(x[DATE_COLUMN].strip(), '%d/%m/%Y'))
>>> print most_recent
['Abraham st.', ' 03/03/2011']
I think your intent is to group by the "Address" column and select the most recent date from the "Date" column, in which case you can use itertools.groupby()
like this:
import csv
from itertools import groupby
from datetime import datetime
ADDRESS_COLUMN = 0
DATE_COLUMN = 1
most_recent = []
with open('input.csv') as f:
reader = csv.reader(f, delimiter='|')
next(reader) # skip over the CSV header row
for k, g in groupby(sorted(reader), lambda x : x[ADDRESS_COLUMN]):
most_recent.append(max(g, key=lambda x : datetime.strptime(x[DATE_COLUMN].strip(), '%d/%m/%Y')))
>>> print most_recent
[['Abraham st.', ' 03/03/2011'], ['Moses rd.', ' 10/12/2013'], ['Smith St.', ' 01/01/1999']]
Assuming input.csv contains this:
Address |Date
Abraham st.| 01/01/2008
Abraham st.| 02/02/2007
Abraham st.| 03/03/2011
Moses rd.| 10/12/2013
Moses rd.| 11/11/2011
Smith St.| 01/01/1999
Upvotes: 1
Reputation: 1125398
Keep track of the highest value seen so far instead; I'm assuming here you already have a csv.reader()
object reading the CSV data:
from datetime import datetime
max_date = datetime.min
newest_row = None
for row in csv_reader:
# assumption: your date is the 4th column in each row
date = datetime.strptime(row[3], '%m/%d/%Y')
if date > max_date:
# row is newer, remember it
max_date = date
newest_row = row
When you've read the whole file, newest_row
will hold the data row with the most recent date. However, the cold never holds more than 2 rows in memory (the currently row being processed, and the newest row found so far).
Note that I started max_date
as datetime.min
, which is the minimum value you can store in a datetime
object; as long as your input file does not contain any rows for January 1st in the year 1, you should be good.
Upvotes: 2
Reputation: 277
Not sure you need to "compare with every other row" (but that might just be me misunderstanding your intent. I would simply save the currently newest row as I loop over the column.
Something like this pseudo code:
saved_row = Null
for row in table:
if not saved_row:
saved_row = row
else if row.date > saved_row.date:
saved_row = row
There is probably a more elegant way to store the initial row into saved_row
Upvotes: 0