Reputation: 223
SO, I have been trying to edit this script in order to define how many days ago the code starts working for each row.
import csv
import datetime
import copy
from collections import defaultdict
with open(r"C:\Temp\test.csv") as i, open(r"C:\Temp\resuls.csv", "wb") as o:
rdr = csv.reader(i)
wrt = csv.writer(o)
data, currdate = defaultdict(lambda:[0, 0, 0, 0]), None
for line in rdr:
date, name = datetime.datetime.strptime(line[0], '%d/%m/%Y'), line[7]
if date != currdate or not currdate:
for v in data.itervalues(): v[:2] = v[2:]
currdate = date
wrt.writerow(line + data[name][:2])
data[name][3] += 1
if line[6] == "1": data[name][2] += 1
02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
02/01/2005,Data,Class tn2,4,10yo+,6,4,Tom Phillips
03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips
03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith
03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips
03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly
03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James
04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips
04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips
05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly
05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James
05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith
05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly
05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips
02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith,0,0
02/01/2005,Data,Class tn2,4,10yo+,6,4,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith,1,2
03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith,1,2
04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James,0,1
04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,1,2
04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith,1,4
04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith,1,4
04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith,1,4
04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips,0,3
05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly,1,2
05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James,0,2
05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith,2,7
05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly,1,2
05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips,0,5
It is doing the following and appending as two new columns:
Currently my source CSV contains years of historic data, what I need to be able to do is to limit the amount of time it remembers the data for. I am not sure if the best way to achieve this is by editing the original code or by re writing the whole thing. For instance if I wanted to perform the same counts for the last 365 days only or the last 60 days only. Any suggestions of a method or edit to achieve this?
Upvotes: 2
Views: 906
Reputation: 975
If I understand your requirement correctly, you want to calculate 1_occurrence (in Column 7) & name_occurrence (Column 8) in the last N days and ignore any data prior to Nth Day
As you want to limit the search by N of days, you need to maintain date wise data. If you do that, your script would still work with some tweak
You can do something like this
def f5(last_N_days):
with open("data.csv") as i, open("resuls.csv", "wb") as o:
rdr = csv.reader(i)
wrt = csv.writer(o)
data, currdate = defaultdict(lambda:[0, 0,0,0]), None
cumulativedata = defaultdict(lambda:defaultdict(lambda:[0, 0,0,0]))
for line in rdr:
date, name = datetime.datetime.strptime(line[0], '%d/%m/%Y'), line[7]
if date != currdate or not currdate:
for k,v in data.iteritems():
cumulativedata[currdate][k] = list(v[2:])
v[:2]=[v[2] - cumulativedata[clean_before][k][0],
v[3] - cumulativedata[clean_before][k][1]]
clean_before = date + datetime.timedelta(days=-(last_N_days))
currdate = date
wrt.writerow(line + data[name][:2])
data[name][3] += 1
if line[6] == "1":
data[name][2] += 1
f5(1) # Calculate only for last 1 day
which produces the following
02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith,0,0
02/01/2005,Data,Class tn2,4,10yo+,6,4,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith,1,2
03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith,1,2
04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James,0,0
04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips,0,2
05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly,0,0
05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James,0,1
05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith,1,3
05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly,0,0
05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips,0,2
Note : I assumed every date has an entry. If this is not true (if any gap), you may want to tweak the script accordingly
Just a thought:
If the reading huge historical data is hurting you, then you can limit the csv data read to 2*N_days (N1...Nn,Nn+1...N2n where Nn+1...N2n is your working block)
a solution is proposed here
then workout your search logic within this buffer
Upvotes: 4
Reputation: 17829
If you really don't want to use a database, and load the whole csv, I would use pandas for this kind of querying (note,(possibly much) slower than SQL querying)
Here is how you would load the csv and get only the last 60 days:
import pandas as pd
import datetime
today = datetime.date.today()
sixty_days_ago = today - datetime.timedelta(days=60)
df = pd.io.parsers.read_csv("your_csv.csv")
#assume column name is date
condition = (df.date >= sixty_days_ago) & (df.date <= now)
df_between_correct_dates = df[condition]
now df_between_correct_dates
will be a pandas DataFrame
that includes only rows from your csv that had a date in the date column that is no older than 60 days ago!
Let me know if this does not solve your problem.
Upvotes: 0