Reputation: 837
I have a CSV file that has minute data in it.
The end goal is to find the standard deviations of all of the lows ('Low'
column) of each day, using all of the data of each day.
The issue is that the csv file has some holes in it in that it does not have exactly 390 minutes(number on minutes in a trading day). The code looks like this:
import pandas as pd
import datetime as dt
df = pd.read_csv('/Volumes/Seagate Portable/S&P 500 List/AAPL.txt')
df.columns = ['Extra', 'Dates', 'Open', 'High', 'Low', 'Close', 'Volume']
df.drop(['Extra', 'Open', 'High', 'Volume'], axis=1, inplace=True)
df.Dates = pd.to_datetime(df.Dates)
df.set_index(df.Dates, inplace=True)
df = df.between_time('9:30', '16:00')
print(df.Low[::390])
The output is as follows:
Dates
2020-01-02 09:30:00 73.8475
2020-01-02 16:00:00 75.0875
2020-01-03 15:59:00 74.3375
2020-01-06 15:58:00 74.9125
2020-01-07 15:57:00 74.5028
...
2020-12-14 09:41:00 122.8800
2020-12-15 09:40:00 125.9900
2020-12-16 09:39:00 126.5600
2020-12-17 09:38:00 129.1500
2020-12-18 09:37:00 127.9900
Name: Low, Length: 245, dtype: float64
As you can see in the output, even if one 9:30 is missing I can no longer index out by 390. So my solution to this would be to get as much data as possible even if dates are missing in the sense that, when the datetime code goes from say 15:59, or 16:00 to 9:31 or 9:32. In essence when it changes back down for 16 to 9:30? I don't know if there are any other solutions to this? Any ideas? And if this is the solution what would be the best way to code it?
Upvotes: 0
Views: 118
Reputation: 62383
.groupby()
with pandas.Grouper()
on 'date'
, with freq='D'
for day, and then aggregate .std()
on 'low'
.
'date'
column must be a datetime
dtype
. Use pd.to_datetime()
to convert the 'Dates'
column, if needed.df = df.set_index('date').between_time('9:30', '16:00').reset_index()
to select only times within a specific range. This would be done before the .groupby()
.
'date'
column needs to be the index
, to use .between_time()
.import requests
import pandas as pd
# sample stock data
periods = '3600'
resp = requests.get('https://api.cryptowat.ch/markets/poloniex/ethusdt/ohlc', params={'periods': periods})
data = resp.json()
df = pd.DataFrame(data['result'][periods], columns=['date', 'open', 'high', 'low', 'close', 'volume', 'amount'])
# convert to a datetime format
df['date'] = pd.to_datetime(df['date'], unit='s')
# display(df.head())
date open high low close volume amount
0 2020-11-22 02:00:00 550.544464 554.812114 536.523241 542.000000 2865.381737 1.567462e+06
1 2020-11-22 03:00:00 541.485933 551.621355 540.992000 548.500000 1061.275481 5.796859e+05
2 2020-11-22 04:00:00 548.722267 549.751680 545.153196 549.441709 310.874748 1.703272e+05
3 2020-11-22 05:00:00 549.157866 549.499632 544.135302 546.913493 259.077448 1.416777e+05
4 2020-11-22 06:00:00 547.600000 548.000000 541.668524 544.241871 363.433373 1.979504e+05
# groupby day, using pd.Grouper and then get std of low
std = df.groupby(pd.Grouper(key='date', freq='D'))['low'].std().reset_index(name='low std')
# display(std)
date low std
0 2020-11-22 14.751495
1 2020-11-23 14.964803
2 2020-11-24 6.542568
3 2020-11-25 9.523858
4 2020-11-26 24.041421
5 2020-11-27 8.272477
6 2020-11-28 12.340238
7 2020-11-29 8.444779
8 2020-11-30 10.290333
9 2020-12-01 13.605846
10 2020-12-02 6.201248
11 2020-12-03 9.403853
12 2020-12-04 12.667251
13 2020-12-05 10.180626
14 2020-12-06 4.481538
15 2020-12-07 3.881311
16 2020-12-08 10.518746
17 2020-12-09 12.077622
18 2020-12-10 6.161330
19 2020-12-11 5.035066
20 2020-12-12 6.297173
21 2020-12-13 9.739574
22 2020-12-14 3.505540
23 2020-12-15 3.304968
24 2020-12-16 16.753780
25 2020-12-17 10.963064
26 2020-12-18 5.574997
27 2020-12-19 4.976494
28 2020-12-20 7.243917
29 2020-12-21 16.844777
30 2020-12-22 10.348576
31 2020-12-23 15.769288
32 2020-12-24 10.329158
33 2020-12-25 5.980148
34 2020-12-26 8.530006
35 2020-12-27 21.136509
36 2020-12-28 16.115898
37 2020-12-29 10.587339
38 2020-12-30 7.634897
39 2020-12-31 7.278866
40 2021-01-01 6.617027
41 2021-01-02 19.708119
Upvotes: 1