Jeremy Hadfield
Jeremy Hadfield

Reputation: 189

Slice all rows of a DataFrame past a certain value in a column

I am trying to find a more pandorable way to get all rows of a DataFrame past a certain value in the a certain column (the Quarter column in this case).

I want to slice a DataFrame of GDP statistics to get all rows past the first quarter of 2000 (2000q1). Currently, I'm doing this by getting the index number of the value in the GDP_df["Quarter"] column that equals 2000q1 (see below). This seems way too convoluted and there must be an easier, simpler, more idiomatic way to achieve this. Any ideas?

Current Method:

def get_GDP_df():
    GDP_df = pd.read_excel(
        "gdplev.xls", 
        names=["Quarter", "GDP in 2009 dollars"], 
        parse_cols = "E,G", skiprows = 7)
    year_2000 = GDP_df.index[GDP_df["Quarter"] == '2000q1'].tolist()[0]
    GDP_df["Growth"] = (GDP_df["GDP in 2009 dollars"]
        .pct_change()
        .apply(lambda x: f"{round((x * 100), 2)}%"))
    GDP_df = GDP_df[year_2000:]
    return GDP_df

Output:

Also, after the DataFrame has been sliced, the indices now start at 212. Is there a method to renumber the indices so they start at 0 or 1?

Upvotes: 3

Views: 244

Answers (2)

n1tk
n1tk

Reputation: 2490

As pointed in the comments you can use the new awesome method query() that Query the columns of a DataFrame with a boolean expression that uses the top-level pandas.eval() function to evaluate the passed query with pandas.eval method that Evaluate a Python expression as a string using various backends that uses only Python expressions.

import pandas as pd

raw_data = {'ID':['101','101','101','102','102','102','102','103','103','103','103'],
            'Week':['08-02-2000','09-02-2000','11-02-2000','10-02-2000','09-02-2000','08-02-2000','07-02-2000','01-02-2000',
               '02-02-2000','03-02-2000','04-02-2000'],
            'Quarter':['2000q1','2000q2','2000q3','2000q4','2000q1','2000q2','2000q3','2000q4','2000q1','2000q2','2000q3'],
            'GDP in 2000 dollars':[15,15,10,15,15,5,10,10,15,20,11]}


def get_GDP_df():
    GDP_df = pd.DataFrame(raw_data).set_index('ID')
    print(GDP_df) # for reference to see how the data is indexed, printing out to the screen
    GDP_df = GDP_df.query("Quarter >= '2000q1'").reset_index(drop=True) #performing the query() + reindexing the dataframe
    GDP_df["Growth"] = (GDP_df["GDP in 2000 dollars"]
        .pct_change()
        .apply(lambda x: f"{round((x * 100), 2)}%"))
    return GDP_df

get_GDP_df()

Table1: SampleData read. // Table2: FinalResult with re-indexing.

Upvotes: 1

Andy Hayden
Andy Hayden

Reputation: 375375

The following is equivalent:

year_2000 = (GDP_df["Quarter"] == '2000q1').idxmax()
GDP_df["Growth"] = (GDP_df["GDP in 2009 dollars"]
  .pct_change()
  .mul(100)
  .round(2)
  .apply(lambda x: f"{x}%"))
return GDP_df.loc[year_2000:]

Upvotes: 1

Related Questions