cgp25
cgp25

Reputation: 345

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.

Here is my code:

df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()


def make_columns(df):
    df = df
    for i in range(len(df)):  
        over_360 = []
        under_360 = []
        for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
            if w > 360:
                over_360.append(w)
            else:
                under_360.append(w)
        df['over_360'] = len(over_360)
        df['under_360'] = len(under_360)
return df

make_columns(df)

This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.

I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!

Upvotes: 1

Views: 3068

Answers (2)

Kacper Wolkowski
Kacper Wolkowski

Reputation: 1597

you might want to try this:

df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)

I believe it should be a bit faster. You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Upvotes: 1

Rayhane Mama
Rayhane Mama

Reputation: 2424

Your problem was that you were always updating the whole column with the value of the last calculation with these lines:

df['over_360'] = len(over_360)
df['under_360'] = len(under_360)

what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:

df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))

what it does is, it sets a value in line i and column over_360 or under_360.

you can learn more about it here.

If you don't like using set_values you can also use this:

df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)

you can check dataframe.ix here.

Upvotes: 1

Related Questions