John
John

Reputation: 71

Python retrieve row index of a Dataframe

Could I ask how to retrieve an index of a row in a DataFrame? Specifically, I am able to retrieve the index of rows from a df.loc.

idx = data.loc[data.name == "Smith"].index

I can even retrieve row index from df.loc by using data.index like this:

idx = data.loc[data.index == 5].index

However, I cannot retrieve the index directly from the row itself (i.e., from row.index, instead of df.loc[].index). I tried using these codes:

idx = data.iloc[5].index

The result of this code is the column names.

To provide context, the reason I need to retrieve the index of a specific row (instead of rows from df.loc) is to use df.apply for each row. I plan to use df.apply to apply a code to each row and copy the data from the row immediately above them.

def retrieve_gender (row):
    # This is a panel data, whose only data in 2000 is already keyed in. Time-invariant data in later years are the same as those in 2000.
    if row["Year"] == 2000:
        pass
    elif row["Year"] == 2001: # To avoid complexity, let's use only year 2001 as example.
        idx = row.index # This is wrong code.
        row["Gender"] = row.iloc[idx-1]["Gender"]
    return row["Gender"]


data["Gender"] = data.apply(retrieve_gender, axis=1)

Upvotes: 1

Views: 12405

Answers (2)

jpp
jpp

Reputation: 164613

apply gives series indexed by column labels

The problem with idx = data.iloc[5].index is data.iloc[5] converts a row to a pd.Series object indexed by column labels.

In fact, what you are asking for is impossible via pd.DataFrame.apply because the series that feeds your retrieve_gender function does not include any index identifier.

Use vectorised logic instead

With Pandas row-wise logic is inefficient and not recommended; it involves a Python-level loop. Use columnwise logic instead. Taking a step back, it seems you wish to implement 2 rules:

  1. If Year is not 2001, leave Gender unchanged.
  2. If Year is 2001, use Gender from previous row.

np.where + shift

For the above logic, you can use np.where with pd.Series.shift:

data['Gender'] = np.where(data['Year'] == 2001, data['Gender'].shift(), data['Gender'])

mask + shift

Alternatively, you can use mask + shift:

data['Gender'] = data['Gender'].mask(data['Year'] == 2001, data['Gender'].shift())

Upvotes: 0

SimbaPK
SimbaPK

Reputation: 596

With Pandas you can loop through your dataframe like this :

for index in range(len(df)): 
    if df.loc[index,'year'] == "2001":
        df.loc[index,'Gender'] = df.loc[index-1 ,'Gender']

Upvotes: 1

Related Questions