Reputation: 69
I'm just looking for some help in deciding a method that would be most efficient. I have one dataset with particular dates, there is no regular timestep. For each of these dates I want to create a row with values ranging from 10 days before to 3 days after the date. The data I need is in 2 columns, dates in one, values in the other.
What sprung to mind was to use a loop to compare the dates and extract the values I need. I am thinking there might be a better way, using numpy\pandas or maybe something else? I feel like my idea is a fairly convoluted way of going about things.
EDIT: So the data in would be like this.
Date Values
2014-02-09 38.351
2014-02-10 38.281
2014-02-11 38.146
2014-02-12 38.205
2014-02-13 38.428
2014-02-14 38.449
2014-02-15 38.540
2014-02-16 38.586
2014-02-17 38.489
2014-02-18 38.552
2014-02-19 38.580
2014-02-20 38.447
2014-02-21 38.336
2014-02-22 38.284
2014-02-23 38.183
2014-02-24 38.143
2014-02-25 38.146
2014-02-26 38.221
2014-02-27 38.182
2014-02-28 38.170
And a sample output for one row would be in the form:
t-10 t-9 t-8 t-7 t-6 t-5 t-4 t-3 \
Date
2014-02-19 37.728 37.753 37.652 37.549 37.474 37.407 37.344 37.278
t-2 t-1 t t+1 t+2 t+3
Date
2014-02-19 37.221 37.18 37.125 37.138 37.414 37.394
Where the values from t-10 to t+3 are extracted when t = 2014-02-19. I need to do this for several different dates.
Edit: I have these specific dates I need to use. The values t-10 to t+3 with t as each of the below dates for example. This is what lead me to consider using a loop. But it seems like a messy way of doing things.
Date
0 2014-11-22
1 2014-12-28
2 2015-01-02
3 2015-02-04
4 2015-02-16
5 2015-02-28
6 2015-03-12
7 2015-03-24
8 2015-04-05
9 2015-04-15
10 2015-04-17
11 2015-04-20
12 2015-11-07
13 2015-11-10
14 2015-11-19
15 2015-11-22
16 2015-11-29
17 2015-12-01
18 2015-12-04
19 2015-12-11
Upvotes: 1
Views: 281
Reputation: 157
Using exactly the table you provided, I first created the columns before calling them just for clarity.
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
for daysDelta in range(-10,4):
key = 't'+str(daysDelta)
df[key] = np.nan
Wich gives me this when I look at the first five rows:
Date Values t-10 t-9 t-8 t-7 t-6 t-5 t-4 t-3 t-2 t-1 t0 t1 t2 t3
0 2014-09-02 38.351 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2014-10-02 38.281 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2014-11-02 38.146 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2014-12-02 38.205 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2014-02-13 38.428 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Then you want to loop through every row and assign the correct value to each column using the index (which row to assing) and key (which column):
for index, row in df.iterrows():
for daysDelta in range(-10,4): #loops through days
key = 't'+str(daysDelta)
# will be true if the difference of days is the one you are looking for
booleanTimeDelta = ((df.loc[:,'Date'] -row['Date']).dt.days == daysDelta)
# if any are true find them and assign it
if any(booleanTimeDelta):
df.loc[index:index+1,key] = df.loc[booleanTimeDelta,'Values'].values
This is the output for your example
Date Values t-10 t-9 t-8 t-7 t-6 t-5 t-4 t-3 t-2 t-1 t0 t1 t2 t3
0 2014-09-02 38.351 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 38.351 NaN NaN NaN
1 2014-10-02 38.281 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 38.281 NaN NaN NaN
2 2014-11-02 38.146 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 38.146 NaN NaN NaN
3 2014-12-02 38.205 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 38.205 NaN NaN NaN
4 2014-02-13 38.428 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 38.428 38.449 38.540 38.586
5 2014-02-14 38.449 NaN NaN NaN NaN NaN NaN NaN NaN NaN 38.428 38.449 38.540 38.586 38.489
6 2014-02-15 38.540 NaN NaN NaN NaN NaN NaN NaN NaN 38.428 38.449 38.540 38.586 38.489 38.552
7 2014-02-16 38.586 NaN NaN NaN NaN NaN NaN NaN 38.428 38.449 38.540 38.586 38.489 38.552 38.580
8 2014-02-17 38.489 NaN NaN NaN NaN NaN NaN 38.428 38.449 38.540 38.586 38.489 38.552 38.580 38.447
9 2014-02-18 38.552 NaN NaN NaN NaN NaN 38.428 38.449 38.540 38.586 38.489 38.552 38.580 38.447 38.336
10 2014-02-19 38.580 NaN NaN NaN NaN 38.428 38.449 38.540 38.586 38.489 38.552 38.580 38.447 38.336 38.284
11 2014-02-20 38.447 NaN NaN NaN 38.428 38.449 38.540 38.586 38.489 38.552 38.580 38.447 38.336 38.284 38.183
12 2014-02-21 38.336 NaN NaN 38.428 38.449 38.540 38.586 38.489 38.552 38.580 38.447 38.336 38.284 38.183 38.143
13 2014-02-22 38.284 NaN 38.428 38.449 38.540 38.586 38.489 38.552 38.580 38.447 38.336 38.284 38.183 38.143 38.146
14 2014-02-23 38.183 38.428 38.449 38.540 38.586 38.489 38.552 38.580 38.447 38.336 38.284 38.183 38.143 38.146 38.221
15 2014-02-24 38.143 38.449 38.540 38.586 38.489 38.552 38.580 38.447 38.336 38.284 38.183 38.143 38.146 38.221 38.182
16 2014-02-25 38.146 38.540 38.586 38.489 38.552 38.580 38.447 38.336 38.284 38.183 38.143 38.146 38.221 38.182 38.182
17 2014-02-26 38.221 38.586 38.489 38.552 38.580 38.447 38.336 38.284 38.183 38.143 38.146 38.221 38.182 38.182 NaN
18 2014-02-27 38.182 38.489 38.552 38.580 38.447 38.336 38.284 38.183 38.143 38.146 38.221 38.182 38.182 NaN NaN
Upvotes: 1
Reputation: 157
I'm going to make many assumptions about what you want but I think I understand your problem. You have something like this.
In [1]: df
Out[1]:
dates numbers sumOfDates
0 2016-02-04 1 NaN
1 2016-02-13 2 NaN
2 2016-01-25 4 NaN
3 2016-01-16 1 NaN
4 2016-01-27 3 NaN
5 2016-01-13 4 NaN
6 2016-01-15 5 NaN
7 2016-01-29 1 NaN
8 2016-01-11 3 NaN
9 2016-01-17 4 NaN
All those NaN is where
Here numbers being the values you talked about. Then this should be simple by iterating through rows locating all the files within the desired range. Then just apply to the sum and assign it to the appropiate row in the sumOfDates
for index, row in df.iterrows():
df.loc[index,'sumOfDates'] = df[(df.dates >(row['dates']-pd.DateOffset(10)))].loc[(df.dates <
(row['dates']+pd.DateOffset(3)))].numbers.sum()
The output should be the following:
df
dates numbers sumOfDates
0 2016-02-04 1 5.0
1 2016-02-13 2 3.0
2 2016-01-25 4 12.0
3 2016-01-16 1 17.0
4 2016-01-27 3 8.0
5 2016-01-13 4 12.0
6 2016-01-15 5 17.0
7 2016-01-29 1 8.0
8 2016-01-11 3 7.0
9 2016-01-17 4 17.0
If I'm wrong please post some examples so we can see if this is what you're looking for. But I hope you can see what you can do with pandas.
Upvotes: 0