Issue trying to read and filter columns using pd.read_csv when columns have same names

Question

I have imported data from Yfinance using

historicaldata = pdr.get_data_yahoo(tickers, period='1mo', interval='5m', prepost=True, group_by='ticker') historicaldata.to_csv('./store/historicaldata.csv')

which creates the following csv

What I now want to do is pull in the "M6A=F" & "Volume" column and do a sum on the total volume but I am running into a problem trying to get at that column from the csv file using pandas.

If I try to load the columns back in using

voldata=pd.read_csv('./store/historicaldata.csv', usecols=['M6A=F'])

voldata.to_csv('volumetest.csv')

it only brings in the first "M6A=F" column it finds (the one with "Open" in the second row in the previous image) and no amount of filtering attempts helped me get to the M6A=F volume column.

So I did a test output using

voldata=pd.read_csv('./store/historicaldata.csv')

voldata.to_csv('volumetest.csv')

And discovered it creates the following csv where it has changed all the columns to unique data which makes me think it does that when loading it in using pd.read_csv.

(incidentally, I need to do this from the csv to avoid repeatedly pulling large amounts off yfinance every time I want to work on historical data)

What is the correct way for me to get at the "M6A=F" "Volume" column from a csv?

Timus · Accepted Answer

Since you're reading a csv-file that was produced from a dataframe with multi-index columns, you have to take that into account when reading it back into a dataframe.

Try something like the following:

from pandas_datareader import data as pdr
import yfinance as yf
yf.pdr_override()

tickers = "MYM=F M6A=F"
hist_data = pdr.get_data_yahoo(
    tickers, period="1mo", interval="5m", prepost=True, group_by="ticker"
)

# Writing to csv-file
hist_data.to_csv("hist_data.csv")

# Reading back from csv-file
hist_data = pd.read_csv("hist_data.csv", index_col=0, header=[0, 1])

# Selecting the M6A=F/Volume-column:
volume = hist_data[[("M6A=F", "Volume")]]
print(volume)

The first change is to set an index column by using index_col=0 (obviously the first here). And the second, header=[0, 1], is to make sure that the first 2 rows are used to build the multi-index columns. See here:

header : int, list of int, None, default ‘infer’

... The header can be a list of integers that specify row locations for a multi-index on the columns ...

Result:

                           M6A=F
                          Volume
Datetime                        
2022-06-06 09:40:00-04:00    0.0
2022-06-06 09:45:00-04:00   67.0
2022-06-06 09:50:00-04:00   36.0
2022-06-06 09:55:00-04:00   18.0
2022-06-06 10:00:00-04:00   61.0
...                          ...
2022-07-06 09:20:00-04:00   47.0
2022-07-06 09:25:00-04:00   12.0
2022-07-06 09:30:00-04:00    7.0
2022-07-06 09:31:10-04:00    0.0
2022-07-06 09:31:20-04:00    NaN

[6034 rows x 1 columns]

(I've used double brackets here hist_data[[("M6A=F", "Volume")]] to get a dataframe that shows the column label. If you don't need that, use single brackets hist_data[("M6A=F", "Volume")] etc.)

Issue trying to read and filter columns using pd.read_csv when columns have same names

Answers (1)

Related Questions