sparc_spread
sparc_spread

Reputation: 10843

Pandas indexer methods and tuples as parameters

Let's say I have a pandas Series, and I want to access a set of elements at specific indices, like so:

In [1]:
from pandas import Series
import numpy as np

s = Series(np.arange(0,10))

In [2]: s.loc[[3,7]]

Out[2]:
3    3
7    7
dtype: int64

The .loc method accepts a list as the parameter for this type of selection. The .iloc and .ix methods work the same way.

However, if I use a tuple for the parameter, both .loc and .iloc fail:

In [5]: s.loc[(3,7)]
---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
........
IndexingError: Too many indexers

In [6]: s.iloc[(3,7)]
---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
........

IndexingError: Too many indexers

And .ix produces a strange result:

In [7]: s.ix[(3,7)]
Out[7]: 3

Now, I get that you can't even do this with a raw python list:

In [27]:
x = list(range(0,10))
x[(3,7)]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-cefdde088328> in <module>()
      1 x = list(range(0,10))
----> 2 x[(3,7)]

TypeError: list indices must be integers or slices, not tuple

To retrieve a set of specific indices from a list, you need to use a comprehension, as explained here.

But on the other hand, using a tuple to select rows from a pandas DataFrame seems to work fine for all three indexing methods. Here's an example with the .loc method:

In [8]:
from pandas import DataFrame
df = DataFrame({"x" : np.arange(0,10)})

In [9]:
df.loc[(3,7),"x"]

Out[9]:
3    3
7    7
Name: x, dtype: int64

My three questions are:

Upvotes: 2

Views: 2260

Answers (2)

JohnE
JohnE

Reputation: 30424

It's hard to answer this in a systematic way, so I'll just answer list-style:

  1. I think the bigger question may be what exactly are you trying to do but are not able to? I.e. why do you want to use () instead of [] when [] is the standard way?
  2. Your first question is really just about why the syntax is a certain way and that's almost an impossible question to answer without going deep into the history of not just pandas but also numpy. In any event, @JoeCondron's simple answer is correct: tuples are for multi-indexes and lists are for advanced indexing (aka "fancy indexing"). I believe that fancy indexing was lifted directly from numpy and multi-indexes were added by pandas.
  3. For your last question, I guess this is an inconsistency, but Series and DataFrames are not the exact same thing, so it's not really possible for their behavior to be 100% consistent. In particular, DataFrame indexing requires extra machinery to distinguish between rows and columns whereas a Series only has to worry about rows.
  4. To answer your question in a very general sense, I think what you show here is that when you use non-standard syntax it might nevertheless work as you want, but it might not. So I don't think it's fair to say that df.loc worked here and s.loc didn't. Neither was guaranteed to work here (according to the documentation), but df.loc happened to. Furthermore, it is quite possible df.loc would stop working like this in a future version.
  5. If you do find an example of loc/iloc/ix not working as shown in the documentation, that should be pointed out and reported as a bug. I don't believe any of the above fall into that category but I could certainly be wrong about that.

Upvotes: 2

JoeCondron
JoeCondron

Reputation: 8906

I think the answer to first question is that tuples are used to locate in a MultiIndex. I don't think there are good answers to the second two questions except that you've exposed a bug and an inconsistency, respectively, in the code (This isn't that hard to do :)). So the Series complains because you don't have a MultiIndex or, more generally, that the length of the tuple is greater than the number of levels in your index. The DataFrame should probably react in the same way but doesn't. I think the safest way to proceed is to reserve tuples for MultiIndex and to use lists/arrays/series for indexing multiple rows. As a side note, you would use a list/array of tuples to select multiple rows in a MultiIndex.

Upvotes: 2

Related Questions