Martien Lubberink
Martien Lubberink

Reputation: 2735

Why do loc and iloc work differently for slicing rows of a pandas DataFrame?

I want a DataFrame where the top rows of one column (called 'cat') have value "LOW", the mid and bottom parts of the frame will have values "MID" and "HI". So, for a frame of 1,200 rows, the value counts for the cat columns should result in:

LOW    400
MID    400
HI     400

This should be easy. But, apparently it is not really. To no avail I tried to select and change the bottom rows using df.loc[-400:,["cat"]] = "HI"

But, this approach does work for the top-rows: df.loc[:399,["cat"]] = "LOW"

The sample below shows a working example, and note that it requires both loc and iloc. Is this where pandas can improve?

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([1200, 4]), columns=['A', 'B', 'C', 'D'])
df["cat"] = "MID"
df.loc[:399,["cat"]] = "LOW"
df.iloc[-400:,-1] = "HI"  # The -1 selects the last column ('cat') - not ideal.
df.cat.value_counts()

Upvotes: 2

Views: 994

Answers (2)

Quickbeam2k1
Quickbeam2k1

Reputation: 5437

Answering the question if pandas can improve here: I the documentation it's clearly stated what loc is doing:

.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found.

so -400 is simply not a label in your index. Thus the behavior is as intended.

What one often wants is and accessor for iloc based row access and loc based column access. But for this, the .get_loc-function comes into play.

You could also use the deprecated .ix-indexer. However, its behavior caused some confusion. She examples and methods using the .loc and .iloc accessors here. Essentially, @Jezrael's solution are also found in the link above.

To summarize: Pandas had a solution to your problem in place, but it confused users. So in order to provide a more consistent API it was decided to remove that feature in the future

Upvotes: 0

jezrael
jezrael

Reputation: 862691

Use get_loc for position of column cat if want select by positions by iloc - need positions of index and columns:

df = pd.DataFrame(np.random.random([1200, 4]), columns=['A', 'B', 'C', 'D'])
df["cat"] = "MID"

df.iloc[:400,df.columns.get_loc('cat')] = "LOW"
df.iloc[-400:,df.columns.get_loc('cat')] = "HI"

Detail:

print (df.columns.get_loc('cat'))
4

Alternative is use loc for select by labels - then need select 400 values of index by indexing:

df.loc[df.index[:400],"cat"] = "LOW"
df.loc[df.index[-400:],"cat"] = "HI"

a = df.cat.value_counts()
print (a)
MID    400
HI     400
LOW    400
Name: cat, dtype: int64

Another ways for set 400 values use numpy.repeat or set values by repeat of lists:

df["cat"] =  np.array(["LOW", "MID", "HI"]).repeat(400)

df["cat"] =  ["LOW"] * 400 + ["MID"] * 400 +  ["HI"] * 400
#thanks  @Quickbeam2k1
df = df.assign(cat = ['LOW']*400 + ['MID']*400 + ['HIGH']*400 )

Upvotes: 2

Related Questions