Reputation: 2735
I want a DataFrame where the top rows of one column (called 'cat') have value "LOW", the mid and bottom parts of the frame will have values "MID" and "HI". So, for a frame of 1,200 rows, the value counts for the cat
columns should result in:
LOW 400
MID 400
HI 400
This should be easy. But, apparently it is not really. To no avail I tried to select and change the bottom rows using df.loc[-400:,["cat"]] = "HI"
But, this approach does work for the top-rows: df.loc[:399,["cat"]] = "LOW"
The sample below shows a working example, and note that it requires both loc
and iloc
. Is this where pandas can improve?
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([1200, 4]), columns=['A', 'B', 'C', 'D'])
df["cat"] = "MID"
df.loc[:399,["cat"]] = "LOW"
df.iloc[-400:,-1] = "HI" # The -1 selects the last column ('cat') - not ideal.
df.cat.value_counts()
Upvotes: 2
Views: 994
Reputation: 5437
Answering the question if pandas can improve here: I the documentation it's clearly stated what loc is doing:
.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found.
so -400 is simply not a label in your index. Thus the behavior is as intended.
What one often wants is and accessor for iloc based row access and loc based column access. But for this, the .get_loc
-function comes into play.
You could also use the deprecated .ix
-indexer. However, its behavior caused some confusion. She examples and methods using the .loc
and .iloc
accessors here.
Essentially, @Jezrael's solution are also found in the link above.
To summarize: Pandas had a solution to your problem in place, but it confused users. So in order to provide a more consistent API it was decided to remove that feature in the future
Upvotes: 0
Reputation: 862691
Use get_loc
for position of column cat
if want select by positions by iloc
- need positions of index and columns:
df = pd.DataFrame(np.random.random([1200, 4]), columns=['A', 'B', 'C', 'D'])
df["cat"] = "MID"
df.iloc[:400,df.columns.get_loc('cat')] = "LOW"
df.iloc[-400:,df.columns.get_loc('cat')] = "HI"
Detail:
print (df.columns.get_loc('cat'))
4
Alternative is use loc
for select by labels - then need select 400
values of index by indexing:
df.loc[df.index[:400],"cat"] = "LOW"
df.loc[df.index[-400:],"cat"] = "HI"
a = df.cat.value_counts()
print (a)
MID 400
HI 400
LOW 400
Name: cat, dtype: int64
Another ways for set 400 values use numpy.repeat
or set values by repeat of lists:
df["cat"] = np.array(["LOW", "MID", "HI"]).repeat(400)
df["cat"] = ["LOW"] * 400 + ["MID"] * 400 + ["HI"] * 400
#thanks @Quickbeam2k1
df = df.assign(cat = ['LOW']*400 + ['MID']*400 + ['HIGH']*400 )
Upvotes: 2