Vladimir Fokow
Vladimir Fokow

Reputation: 3883

Dataframe columns having individual indices

I have noticed an interesting behaviour which I haven't seen in the documentation:

Each column inside a dataframe can have its individual index!

df = pd.DataFrame(np.arange(12).reshape(4, 3, order='F'),
                  columns=list('abc'))

df
   a  b   c
0  0  4   8
1  1  5   9
2  2  6  10
3  3  7  11

Assign index to column b:

df['b'].index = [-1, 2, 4, 5]

Different indices for different columns, but they all share the same dataframe index:

df['a']
0    0
1    1
2    2
3    3
Name: a, dtype: int64

df['b']
-1    4
 2    5
 4    6
 5    7
Name: b, dtype: int64


df.loc[:2, ['b']]
   b
0  4
1  5
2  6

df.loc[:2, 'b']
-1    4
 2    5
Name: b, dtype: int64

Is this described somewhere in the documentation?

Why can this be done in the first place? And can this be useful for something?

Upvotes: 14

Views: 368

Answers (1)

Cairoknox
Cairoknox

Reputation: 198

Is this described somewhere in the documentation?

After an extensive look at the doc, it seems not.

Why can this be done in the first place?

Because of the existence of a cache, it seems. If I take your example:

df = pd.DataFrame(np.arange(12).reshape(4, 3, order='F'),
                  columns=list('abc'))
>>> df
    a   b   c
0   0   4   8
1   1   5   9
2   2   6   10
3   3   7   11

>>> df._item_cache
{}

At first, the cache is empty. If you even just access a column, the series will be cached in.

>>> df['a']
0    0
1    1
2    2
3    3
Name: a, dtype: int32

>>> df._item_cache
{'a': 0    0
 1    1
 2    2
 3    3
 Name: a, dtype: int32}

Let's say we change index of series 'b', so now the cache looks like the following.

df['b'].index = [-1, 2, 4, 5]

>>> df._item_cache
{'a': 0    0
 1    1
 2    2
 3    3
 Name: a, dtype: int32,
 'b': -1    4
  2    5
  4    6
  5    7
 Name: b, dtype: int32}

When you access column 'b' through DataFrame's __getitem__(), here is how it seems to work.

key = 'b'
df._get_item_cache(key)

Specifically, in the source code, as 'b' is a hashable type, it will return the cache of the actual series. If the type was not hashable (such as a list, ['b']), then it would make a copy first, and make the cache irrelevant.

When you use .loc's __getitem__(), it works a bit differently. But in essence, it extracts the series with its cached index. However, not exactly sure about why ['b'] would show df indices in that case... There is more digging to make on that side.

And can this be useful for something?

Probably not. Although it could be used as a third dimension for your data, I guess, but there are other, more practical solutions existing I believe.

Upvotes: 3

Related Questions