benito.cano
benito.cano

Reputation: 837

How to aggregate a function on a specific column for each day

I have a CSV file that has minute data in it.

The end goal is to find the standard deviations of all of the lows ('Low' column) of each day, using all of the data of each day.

The issue is that the csv file has some holes in it in that it does not have exactly 390 minutes(number on minutes in a trading day). The code looks like this:

import pandas as pd
import datetime as dt


df = pd.read_csv('/Volumes/Seagate Portable/S&P 500 List/AAPL.txt')
df.columns = ['Extra', 'Dates', 'Open', 'High', 'Low', 'Close', 'Volume']
df.drop(['Extra', 'Open', 'High', 'Volume'], axis=1, inplace=True)
df.Dates = pd.to_datetime(df.Dates)
df.set_index(df.Dates, inplace=True)
df = df.between_time('9:30', '16:00')
print(df.Low[::390])

The output is as follows:

Dates
2020-01-02 09:30:00     73.8475
2020-01-02 16:00:00     75.0875
2020-01-03 15:59:00     74.3375
2020-01-06 15:58:00     74.9125
2020-01-07 15:57:00     74.5028
                         ...   
2020-12-14 09:41:00    122.8800
2020-12-15 09:40:00    125.9900
2020-12-16 09:39:00    126.5600
2020-12-17 09:38:00    129.1500
2020-12-18 09:37:00    127.9900
Name: Low, Length: 245, dtype: float64

As you can see in the output, even if one 9:30 is missing I can no longer index out by 390. So my solution to this would be to get as much data as possible even if dates are missing in the sense that, when the datetime code goes from say 15:59, or 16:00 to 9:31 or 9:32. In essence when it changes back down for 16 to 9:30? I don't know if there are any other solutions to this? Any ideas? And if this is the solution what would be the best way to code it?

Upvotes: 0

Views: 118

Answers (1)

Trenton McKinney
Trenton McKinney

Reputation: 62383

  • Use .groupby() with pandas.Grouper() on 'date', with freq='D' for day, and then aggregate .std() on 'low'.
    • The 'date' column must be a datetime dtype. Use pd.to_datetime() to convert the 'Dates' column, if needed.
  • If desired, use df = df.set_index('date').between_time('9:30', '16:00').reset_index() to select only times within a specific range. This would be done before the .groupby().
import requests
import pandas as pd

# sample stock data
periods = '3600'
resp = requests.get('https://api.cryptowat.ch/markets/poloniex/ethusdt/ohlc', params={'periods': periods})
data = resp.json()
df = pd.DataFrame(data['result'][periods], columns=['date', 'open', 'high', 'low', 'close', 'volume', 'amount'])

# convert to a datetime format
df['date'] = pd.to_datetime(df['date'], unit='s')

# display(df.head())
                 date        open        high         low       close       volume        amount
0 2020-11-22 02:00:00  550.544464  554.812114  536.523241  542.000000  2865.381737  1.567462e+06
1 2020-11-22 03:00:00  541.485933  551.621355  540.992000  548.500000  1061.275481  5.796859e+05
2 2020-11-22 04:00:00  548.722267  549.751680  545.153196  549.441709   310.874748  1.703272e+05
3 2020-11-22 05:00:00  549.157866  549.499632  544.135302  546.913493   259.077448  1.416777e+05
4 2020-11-22 06:00:00  547.600000  548.000000  541.668524  544.241871   363.433373  1.979504e+05

# groupby day, using pd.Grouper and then get std of low
std = df.groupby(pd.Grouper(key='date', freq='D'))['low'].std().reset_index(name='low std')

# display(std)
         date    low std
0  2020-11-22  14.751495
1  2020-11-23  14.964803
2  2020-11-24   6.542568
3  2020-11-25   9.523858
4  2020-11-26  24.041421
5  2020-11-27   8.272477
6  2020-11-28  12.340238
7  2020-11-29   8.444779
8  2020-11-30  10.290333
9  2020-12-01  13.605846
10 2020-12-02   6.201248
11 2020-12-03   9.403853
12 2020-12-04  12.667251
13 2020-12-05  10.180626
14 2020-12-06   4.481538
15 2020-12-07   3.881311
16 2020-12-08  10.518746
17 2020-12-09  12.077622
18 2020-12-10   6.161330
19 2020-12-11   5.035066
20 2020-12-12   6.297173
21 2020-12-13   9.739574
22 2020-12-14   3.505540
23 2020-12-15   3.304968
24 2020-12-16  16.753780
25 2020-12-17  10.963064
26 2020-12-18   5.574997
27 2020-12-19   4.976494
28 2020-12-20   7.243917
29 2020-12-21  16.844777
30 2020-12-22  10.348576
31 2020-12-23  15.769288
32 2020-12-24  10.329158
33 2020-12-25   5.980148
34 2020-12-26   8.530006
35 2020-12-27  21.136509
36 2020-12-28  16.115898
37 2020-12-29  10.587339
38 2020-12-30   7.634897
39 2020-12-31   7.278866
40 2021-01-01   6.617027
41 2021-01-02  19.708119

Upvotes: 1

Related Questions