Reputation: 71
I pulled some stock data from a financial API and created a DataFrame with it. Columns were 'date', 'data1', 'data2', 'data3'. Then, I converted that DataFrame into a CSV with 'date' column as index:
df.to_csv('data.csv', index_label='date')
In a second script, I read that CSV and attempted to slice the resulting DataFrame between two dates:
df = pd.read_csv('data.csv', parse_dates=['date'] ,index_col='date')
df = df['2020-03-28':'2020-04-28']
When I attempt to do this, I get the following TypeError:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [2020-03-28] of <class 'str'>
So clearly, the problem is that I'm trying to use a str to slice a datetime object. But here's the confusing part! If in the first step, I save the DataFrame to a csv and DO NOT set 'date' as index:
df.to_csv('data.csv')
In my second script, I no longer get the TypeError:
df = pd.read_csv('data.csv', parse_dates=['date'] ,index_col='date')
df = df['2020-03-28':'2020-04-28']
Now it works just fine. The only problem is I have the default Pandas index column to deal with.
Why do I get a TypeError when I set the 'date' column as index in my CSV...but I do NOT get a TypeError when I don't set any index in the CSV?
Upvotes: 0
Views: 425
Reputation: 30971
It seems that in your "first" instance of df, date column was an ordinary column (not the index) and this DataFrame had a default index - consecutive integers (its name is not important).
In this situation running df.to_csv('data.csv', index_label='date')
causes that the output file contains:
date,date,data1,data2,data3
0,2020-03-27,10.5,12.3,13.2
1,2020-03-28,10.6,12.9,14.7
i.e.:
Then if you read it running
df = pd.read_csv('data.csv', parse_dates=['date'], index_col='date')
, then:
Now when you run df['2020-03-28':'2020-04-28']
, you attempt to find rows
with index in the range given. But the index column is of Int64Index
type (check this in your installation), hence just the mentioned exception
was thrown.
Things look other way when you run df.to_csv('data.csv')
.
Now this file contains:
,date,data1,data2,data3
0,2020-03-27,10.5,12.3,13.2
1,2020-03-28,10.6,12.9,14.7
i.e.:
Now when you read it, the result is:
And now, when you run df['2020-03-28':'2020-04-28']
everything is OK.
The thing to learn for the future:
Running df.to_csv('data.csv', index_label='date')
does not set this
column as the index. It only saves the current index column
under the given name, without any check whether any other column has
just the same name.
The result is that 2 columns can have the same name.
Upvotes: 2