Counting number of occurrences up to x days previous to row date

Question

SO, I have been trying to edit this script in order to define how many days ago the code starts working for each row.

The code I am using currently:

import csv
import datetime
import copy
from collections import defaultdict

with open(r"C:\Temp	est.csv") as i, open(r"C:\Temp
esuls.csv", "wb") as o:
    rdr = csv.reader(i)
    wrt = csv.writer(o)

    data, currdate = defaultdict(lambda:[0, 0, 0, 0]), None
    for line in rdr:
        date, name = datetime.datetime.strptime(line[0], '%d/%m/%Y'), line[7]

        if date != currdate or not currdate:
            for v in data.itervalues(): v[:2] = v[2:]
            currdate = date

        wrt.writerow(line + data[name][:2])

        data[name][3] += 1
        if line[6] == "1": data[name][2] += 1

The data this works on:

02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
02/01/2005,Data,Class tn2,4,10yo+,6,4,Tom Phillips
03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips
03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith
03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips
03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly
03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James
04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips
04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips
05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly
05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James
05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith
05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly
05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips

And produces:

02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith,0,0
02/01/2005,Data,Class tn2,4,10yo+,6,4,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith,1,2
03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith,1,2
04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James,0,1
04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,1,2
04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith,1,4
04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith,1,4
04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith,1,4
04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips,0,3
05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly,1,2
05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James,0,2
05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith,2,7
05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly,1,2
05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips,0,5

It is doing the following and appending as two new columns:

Counting the number of times a person appears in the list on dates prior to the date specified in the row and a 1 occurs in column 7.
The number of times a person (column 8) appears in the list on dates prior to the date specified in the row (note the source data are sorted chronologically.)

Currently my source CSV contains years of historic data, what I need to be able to do is to limit the amount of time it remembers the data for. I am not sure if the best way to achieve this is by editing the original code or by re writing the whole thing. For instance if I wanted to perform the same counts for the last 365 days only or the last 60 days only. Any suggestions of a method or edit to achieve this?

user2390183 · Accepted Answer

If I understand your requirement correctly, you want to calculate 1_occurrence (in Column 7) & name_occurrence (Column 8) in the last N days and ignore any data prior to Nth Day

As you want to limit the search by N of days, you need to maintain date wise data. If you do that, your script would still work with some tweak

You can do something like this

def f5(last_N_days):
    with open("data.csv") as i, open("resuls.csv", "wb") as o:
        rdr = csv.reader(i)
        wrt = csv.writer(o)
        data, currdate = defaultdict(lambda:[0, 0,0,0]), None
        cumulativedata = defaultdict(lambda:defaultdict(lambda:[0, 0,0,0]))
        for line in rdr:
            date, name = datetime.datetime.strptime(line[0], '%d/%m/%Y'), line[7]

            if date != currdate or not currdate:
                for k,v in data.iteritems():
                    cumulativedata[currdate][k] = list(v[2:])
                    v[:2]=[v[2] - cumulativedata[clean_before][k][0],
                                 v[3] - cumulativedata[clean_before][k][1]]
                clean_before = date + datetime.timedelta(days=-(last_N_days))
                currdate = date

            wrt.writerow(line + data[name][:2])

            data[name][3] += 1

            if line[6] == "1":
                data[name][2] += 1

f5(1) # Calculate only for last 1 day

which produces the following

02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith,0,0
02/01/2005,Data,Class tn2,4,10yo+,6,4,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith,1,2
03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith,1,2
04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James,0,0
04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips,0,2
05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly,0,0
05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James,0,1
05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith,1,3
05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly,0,0
05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips,0,2

Note : I assumed every date has an entry. If this is not true (if any gap), you may want to tweak the script accordingly

Just a thought:

If the reading huge historical data is hurting you, then you can limit the csv data read to 2*N_days (N1...Nn,Nn+1...N2n where Nn+1...N2n is your working block)

a solution is proposed here

then workout your search logic within this buffer

Counting number of occurrences up to x days previous to row date

The code I am using currently:

The data this works on:

And produces:

Answers (2)

Related Questions