AEA
AEA

Reputation: 223

Counting number of occurrences up to x days previous to row date

SO, I have been trying to edit this script in order to define how many days ago the code starts working for each row.

The code I am using currently:

import csv
import datetime
import copy
from collections import defaultdict

with open(r"C:\Temp\test.csv") as i, open(r"C:\Temp\resuls.csv", "wb") as o:
    rdr = csv.reader(i)
    wrt = csv.writer(o)

    data, currdate = defaultdict(lambda:[0, 0, 0, 0]), None
    for line in rdr:
        date, name = datetime.datetime.strptime(line[0], '%d/%m/%Y'), line[7]

        if date != currdate or not currdate:
            for v in data.itervalues(): v[:2] = v[2:]
            currdate = date

        wrt.writerow(line + data[name][:2])

        data[name][3] += 1
        if line[6] == "1": data[name][2] += 1

The data this works on:

02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
02/01/2005,Data,Class tn2,4,10yo+,6,4,Tom Phillips
03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips
03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith
03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips
03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly
03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James
04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips
04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips
05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly
05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James
05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith
05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly
05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips

And produces:

02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith,0,0
02/01/2005,Data,Class tn2,4,10yo+,6,4,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith,1,2
03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith,1,2
04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James,0,1
04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,1,2
04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith,1,4
04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith,1,4
04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith,1,4
04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips,0,3
05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly,1,2
05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James,0,2
05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith,2,7
05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly,1,2
05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips,0,5

It is doing the following and appending as two new columns:

  1. Counting the number of times a person appears in the list on dates prior to the date specified in the row and a 1 occurs in column 7.
  2. The number of times a person (column 8) appears in the list on dates prior to the date specified in the row (note the source data are sorted chronologically.)

Currently my source CSV contains years of historic data, what I need to be able to do is to limit the amount of time it remembers the data for. I am not sure if the best way to achieve this is by editing the original code or by re writing the whole thing. For instance if I wanted to perform the same counts for the last 365 days only or the last 60 days only. Any suggestions of a method or edit to achieve this?

Upvotes: 2

Views: 906

Answers (2)

user2390183
user2390183

Reputation: 975

If I understand your requirement correctly, you want to calculate 1_occurrence (in Column 7) & name_occurrence (Column 8) in the last N days and ignore any data prior to Nth Day

As you want to limit the search by N of days, you need to maintain date wise data. If you do that, your script would still work with some tweak

You can do something like this

def f5(last_N_days):
    with open("data.csv") as i, open("resuls.csv", "wb") as o:
        rdr = csv.reader(i)
        wrt = csv.writer(o)
        data, currdate = defaultdict(lambda:[0, 0,0,0]), None
        cumulativedata = defaultdict(lambda:defaultdict(lambda:[0, 0,0,0]))
        for line in rdr:
            date, name = datetime.datetime.strptime(line[0], '%d/%m/%Y'), line[7]

            if date != currdate or not currdate:
                for k,v in data.iteritems():
                    cumulativedata[currdate][k] = list(v[2:])
                    v[:2]=[v[2] - cumulativedata[clean_before][k][0],
                                 v[3] - cumulativedata[clean_before][k][1]]
                clean_before = date + datetime.timedelta(days=-(last_N_days))
                currdate = date

            wrt.writerow(line + data[name][:2])

            data[name][3] += 1

            if line[6] == "1":
                data[name][2] += 1

f5(1) # Calculate only for last 1 day

which produces the following

02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith,0,0
02/01/2005,Data,Class tn2,4,10yo+,6,4,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith,1,2
03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith,1,2
04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James,0,0
04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips,0,2
05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly,0,0
05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James,0,1
05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith,1,3
05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly,0,0
05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips,0,2

Note : I assumed every date has an entry. If this is not true (if any gap), you may want to tweak the script accordingly

Just a thought:

If the reading huge historical data is hurting you, then you can limit the csv data read to 2*N_days (N1...Nn,Nn+1...N2n where Nn+1...N2n is your working block)

a solution is proposed here

then workout your search logic within this buffer

Upvotes: 4

Ryan Saxe
Ryan Saxe

Reputation: 17829

If you really don't want to use a database, and load the whole csv, I would use pandas for this kind of querying (note,(possibly much) slower than SQL querying)

Here is how you would load the csv and get only the last 60 days:

import pandas as pd
import datetime

today = datetime.date.today()
sixty_days_ago = today - datetime.timedelta(days=60)

df = pd.io.parsers.read_csv("your_csv.csv")
#assume column name is date
condition = (df.date >= sixty_days_ago) & (df.date <= now)

df_between_correct_dates = df[condition]

now df_between_correct_dates will be a pandas DataFrame that includes only rows from your csv that had a date in the date column that is no older than 60 days ago!

Let me know if this does not solve your problem.

Upvotes: 0

Related Questions