Jcb Rb
Jcb Rb

Reputation: 71

Why am I getting this TypeError when I try to slice my Pandas DataFrame?

I pulled some stock data from a financial API and created a DataFrame with it. Columns were 'date', 'data1', 'data2', 'data3'. Then, I converted that DataFrame into a CSV with 'date' column as index:

df.to_csv('data.csv', index_label='date')

In a second script, I read that CSV and attempted to slice the resulting DataFrame between two dates:

df = pd.read_csv('data.csv', parse_dates=['date'] ,index_col='date')
df = df['2020-03-28':'2020-04-28']

When I attempt to do this, I get the following TypeError:

TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [2020-03-28] of <class 'str'>

So clearly, the problem is that I'm trying to use a str to slice a datetime object. But here's the confusing part! If in the first step, I save the DataFrame to a csv and DO NOT set 'date' as index:

df.to_csv('data.csv')

In my second script, I no longer get the TypeError:

df = pd.read_csv('data.csv', parse_dates=['date'] ,index_col='date')
df = df['2020-03-28':'2020-04-28']

Now it works just fine. The only problem is I have the default Pandas index column to deal with.

Why do I get a TypeError when I set the 'date' column as index in my CSV...but I do NOT get a TypeError when I don't set any index in the CSV?

Upvotes: 0

Views: 425

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 30971

It seems that in your "first" instance of df, date column was an ordinary column (not the index) and this DataFrame had a default index - consecutive integers (its name is not important).

In this situation running df.to_csv('data.csv', index_label='date') causes that the output file contains:

date,date,data1,data2,data3
0,2020-03-27,10.5,12.3,13.2
1,2020-03-28,10.6,12.9,14.7

i.e.:

  • the index column (integers) was given date name, passed by you in index_label parameter,
  • the next column, which in df was named date was also given date name.

Then if you read it running df = pd.read_csv('data.csv', parse_dates=['date'], index_col='date'), then:

  • the first date column (integers) has been read as date and set as the index,
  • the second date column (dates) has been read as date.1 and it is an ordinary column.

Now when you run df['2020-03-28':'2020-04-28'], you attempt to find rows with index in the range given. But the index column is of Int64Index type (check this in your installation), hence just the mentioned exception was thrown.

Things look other way when you run df.to_csv('data.csv'). Now this file contains:

,date,data1,data2,data3
0,2020-03-27,10.5,12.3,13.2
1,2020-03-28,10.6,12.9,14.7

i.e.:

  • the first column (which in df was the index) has no name and int values,
  • the only column named date is the second column and contains dates.

Now when you read it, the result is:

  • date (converted do DatetimeIndex) is the index,
  • the original index column got name Unnamed: 0, no surprise, since in the source file it had no name.

And now, when you run df['2020-03-28':'2020-04-28'] everything is OK.

The thing to learn for the future: Running df.to_csv('data.csv', index_label='date') does not set this column as the index. It only saves the current index column under the given name, without any check whether any other column has just the same name.

The result is that 2 columns can have the same name.

Upvotes: 2

Related Questions