Reputation: 402
I have imported data from Yfinance using
historicaldata = pdr.get_data_yahoo(tickers, period='1mo', interval='5m', prepost=True, group_by='ticker')
historicaldata.to_csv('./store/historicaldata.csv')
which creates the following csv
What I now want to do is pull in the "M6A=F" & "Volume" column and do a sum on the total volume but I am running into a problem trying to get at that column from the csv file using pandas.
If I try to load the columns back in using
voldata=pd.read_csv('./store/historicaldata.csv', usecols=['M6A=F'])
voldata.to_csv('volumetest.csv')
it only brings in the first "M6A=F" column it finds (the one with "Open" in the second row in the previous image) and no amount of filtering attempts helped me get to the M6A=F volume column.
So I did a test output using
voldata=pd.read_csv('./store/historicaldata.csv')
voldata.to_csv('volumetest.csv')
And discovered it creates the following csv where it has changed all the columns to unique data which makes me think it does that when loading it in using pd.read_csv.
(incidentally, I need to do this from the csv to avoid repeatedly pulling large amounts off yfinance every time I want to work on historical data)
What is the correct way for me to get at the "M6A=F" "Volume" column from a csv?
Upvotes: 1
Views: 124
Reputation: 11351
Since you're reading a csv-file that was produced from a dataframe with multi-index columns, you have to take that into account when reading it back into a dataframe.
Try something like the following:
from pandas_datareader import data as pdr
import yfinance as yf
yf.pdr_override()
tickers = "MYM=F M6A=F"
hist_data = pdr.get_data_yahoo(
tickers, period="1mo", interval="5m", prepost=True, group_by="ticker"
)
# Writing to csv-file
hist_data.to_csv("hist_data.csv")
# Reading back from csv-file
hist_data = pd.read_csv("hist_data.csv", index_col=0, header=[0, 1])
# Selecting the M6A=F/Volume-column:
volume = hist_data[[("M6A=F", "Volume")]]
print(volume)
The first change is to set an index column by using index_col=0
(obviously the first here). And the second, header=[0, 1]
, is to make sure that the first 2 rows are used to build the multi-index columns. See here:
header : int, list of int, None, default ‘infer’
... The header can be a list of integers that specify row locations for a multi-index on the columns ...
Result:
M6A=F
Volume
Datetime
2022-06-06 09:40:00-04:00 0.0
2022-06-06 09:45:00-04:00 67.0
2022-06-06 09:50:00-04:00 36.0
2022-06-06 09:55:00-04:00 18.0
2022-06-06 10:00:00-04:00 61.0
... ...
2022-07-06 09:20:00-04:00 47.0
2022-07-06 09:25:00-04:00 12.0
2022-07-06 09:30:00-04:00 7.0
2022-07-06 09:31:10-04:00 0.0
2022-07-06 09:31:20-04:00 NaN
[6034 rows x 1 columns]
(I've used double brackets here hist_data[[("M6A=F", "Volume")]]
to get a dataframe that shows the column label. If you don't need that, use single brackets hist_data[("M6A=F", "Volume")]
etc.)
Upvotes: 2