EB727
EB727

Reputation: 63

Creating a new column that can equal one of many columns depending on condition (Pandas)

I have a wide pandas dataframe with a number of variables with titles of the form 'crimeYR.' For example, crime1996 is a dummy variable that tells me if an observation has been convicted of a crime up until the year 1996, crime1998 tells me whether he/she has been convicted of a crime up until 1998, and so forth. Everyone in the dataset is born in different years and I would like to create a single variable that tells me whether or not a person has committed a crime by age 25. Here is an example of what I would like to create

birthYR  crime2006 crime2008 crime2010 crimeby25
1981         0         1         1         0
1981         1         1         1         1
1983         0         1         1         1
1982         0         0         1         0

I have a general idea of how one might code it in stata, but I am struggling to get it to work in python. Here's an idea of how it might work in stata:

gen crimeby25 = 0
foreach v of num 2006/2016{
     replace crimeby25 = crime`v' if `v' - birthyr == 25
}

What's a simple way of doing what I'm trying to do in Python?

Upvotes: 1

Views: 59

Answers (1)

plalanne
plalanne

Reputation: 1030

Here is a solution. You have the following dataframe :

df = pd.DataFrame({'birthYR': [1981,1981,1983,1982],
                   'crime2006': [0,1,0,0],
                   'crime2008': [1,1,1,0],
                   'crime2010':[1,1,1,1]})

df

birthYR  crime2006 crime2008 crime2010 crimeby25
1981         0         1         1         0
1981         1         1         1         1
1983         0         1         1         1
1982         0         0         1         0

Let's first define the list of years that we are studying :

years = [2006,2008,2010]

We create a few intermediary useful columns

for year in years :
    # Compute the age at a given year
    df["AgeIn"+str(year)] = year - df["birthYR"]

    # Is he/she more than 25 at a given year
    df["NotMoreThan25In"+str(year)] = df["AgeIn"+str(year)]<=25

    # Let's remove age column for clarity
    df = df.drop("AgeIn"+str(year),axis=1)

    # Check if he/she commited crime and was not more than 25 at a given year
    df["NotMoreThan25In"+str(year)+"AndCrime"] = df["NotMoreThan25In"+str(year)]*df["crime"+str(year)]

In the end, we just sum the total on each year to see if he/she commited crime by 25 :

df["crimeby25"] = df[["Not25In"+str(year)+"AndCrime" for year in years]].max(axis=1)

df["crimeby25"]

0    0
1    1
2    1
3    0

Upvotes: 1

Related Questions