mpny1
mpny1

Reputation: 107

Unable to slice pandas dataframe (with date as key) using date as string

I'm generating an empty dataframe with a series of dates as the index. Data will be added to the dataframe at a later point.

cbd=pd.date_range(start=pd.datetime(2017,01,02),end=pd.datetime(2017,01,30),period=1)

df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)

df.head()
           Test1 Test2
2017-01-02   NaN   NaN
2017-01-03   NaN   NaN
2017-01-04   NaN   NaN
2017-01-05   NaN   NaN
2017-01-06   NaN   NaN

A few slicing methods don't seem to work. The following returns a KeyError:

df['2017-01-02']

However any of the following work:

df['2017-01-02':'2017-01-02']
df.loc['2017-01-02']

What am I missing here? Why doesn't the first slice return a result?

Upvotes: 5

Views: 8429

Answers (3)

Mohammad Yusuf
Mohammad Yusuf

Reputation: 17064

Dual behavior of [] in df[]

  • When you don't use : inside [], then the value(s) inside it will be considered as column(s).
  • And when you use : inside [], then the value(s) inside it will be considered as row(s).

Why the dual nature?

Because most of the time people want to slice the rows instead of slicing the columns.

So they decided that x and y in df[x:y] should correspond to rows,

and x in d[x] or x, y in df[[x,y]] should correspond to column(s).

Example:

df = pd.DataFrame(data = [[1,2,3], [1,2,3], [1,2,3]],
                                 index = ['A','B','C'], columns = ['A','B','C'])
print df

Output:

   A  B  C
A  1  2  3
B  1  2  3
C  1  2  3

Now when you do df['B'], it can mean 2 things:

  • Take the 2nd index B and give you the 2nd row 1 2 3

                     OR
    
  • Take the 2nd column B and give you the 2nd column 2 2 2.

So in order to resolve this conflict and keep it unambiguous df['B'] will always mean that you want the column 'B', if there is no such column then it will throw an Error.

Why does df['2017-01-02'] fails?

It will search for a column '2017-01-02', Because there is no such column, it throws an error.

Why does df.loc['2017-01-02'] works then?

Because .loc[] has syntax of df.loc[row,column] and you can leave out the column if you will, as in your case, it simply means df.loc[row]

Upvotes: 14

seemo
seemo

Reputation: 257

First I have updated your test data (just for info) as it returns an 'invalid token' error. Please see changes here:

cbd=pd.date_range(start='2017-01-02',end='2017-01-30',period=1)
df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)

Now looking at the first row:

In[1]:

df.head(1)

Out[1]:
          Test1 Test2
2017-01-02  NaN NaN

And then trying the initial slicing approach yields this error:

In[2]:    

df['2017-01-02']

Out[2]:

KeyError: '2017-01-02'

Now try this using the column name:

In[3]:    

df.columns

Out[3]:

Index(['Test1', 'Test2'], dtype='object')

In[4]:

We try 'Test1':

df['Test1']

And get the NaN output from this column.

Out[4]:

2017-01-02    NaN
2017-01-03    NaN
2017-01-04    NaN
2017-01-05    NaN

So the format you are using is designed to be used on the column name unless you use this format df['2017-01-02':'2017-01-02'].

The Pandas docs state "The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one)".

So as you correctly identified, DataFrame.loc is a label based indexer which yields the output you are looking for:

 In[5]:
df.loc['2017-01-02']

 Out[5]:

Test1    NaN
Test2    NaN
Name: 2017-01-02 00:00:00, dtype: object

Upvotes: 1

jezrael
jezrael

Reputation: 862691

There is difference, because use different approaches:

For select one row is necessary loc:

df['2017-01-02']

Docs - partial string indexing:

Warning

The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one):

dft['2013-1-15 12:30:00']

To select a single row, use .loc

In [74]: dft.loc['2013-1-15 12:30:00']
Out[74]: 
A    0.193284
Name: 2013-01-15 12:30:00, dtype: float64

df['2017-01-02':'2017-01-02']

This is pure partial string indexing:

This type of slicing will work on a DataFrame with a DateTimeIndex as well. Since the partial string selection is a form of label slicing, the endpoints will be included. This would include matching times on an included date.

Upvotes: 4

Related Questions