Dsci
Dsci

Reputation: 65

Perform calculation/arithmetic operation on two fields within a Pandas Dataframe constructor

I need to simulate some transactional data using numpy and pandas, similar to the code below:

import random
import numpy as np
import pandas as pd
n=1000

sample_df = pd.DataFrame({ 
'arrival_date':np.random.choice( pd.date_range('1/1/2015', periods=n, 
                      freq='D'), n),
'days_stay': [random.randint(1,14) for x in range(n)]
})

The dataframe needs to have 3 fields, two calculated similarly above PLUS another date field that adds the values of two fields:

'departure_date': 'arrival_date' + 'days_stay'

The catch is that I would prefer to define all three fields within the pandas dataframe constructor, and not have to define a function for this last field and then refer to it in a second dataframe step to get the data.

sample_df = pd.DataFrame({ 
'arrival_date':np.random.choice( pd.date_range('1/1/2015', periods=n, 
                      freq='D'), n),
'days_stay': [random.randint(1,14) for x in range(n)],
'departure_date': 'arrival_date' + 'days_stay'
})

Is this possible?

Thanks in advance.

Upvotes: 3

Views: 615

Answers (3)

Dsci
Dsci

Reputation: 65

@gyx-hh and @jpp

Thank you both for answering my questions. I at least have an answer as to whether this can be done within the dataframe constructor (it can't), and an alternative way to do this with the assign option.

The traditional option of adding the columns in a second step would seem to work, but needed a little tweaking.

sample_df['departure_date'] = sample_df['arrival_date'] + 
pd.TimedeltaIndex(sample_df['days_stay'], unit='D')

Thanks again to you both! Marking as accepted.

Upvotes: 0

jpp
jpp

Reputation: 164653

The catch is that I would prefer to define all three fields within the pandas dataframe constructor

This isn't possible. You can't define a series from two other series which have not yet been created. You can use a pd.DataFrame.assign trick to add your series using method chaining in a subsequent step. But, in my opinion, there's nothing better than:

df['departure_date'] = df['arrival_date'] + df['days_stay']

Upvotes: 1

gyx-hh
gyx-hh

Reputation: 1431

Try the following. Using assign on pd.Dataframe() we get to use the created df and it's data and assign a new column.

sample_df = pd.DataFrame({ 
'arrival_date':np.random.choice( pd.date_range('1/1/2015', periods=n, 
                      freq='D'), n),
'days_stay': [random.randint(1,14) for x in range(n)],
}).assign(departure_date = lambda x: x.arrival_date + x.days_stay.apply(lambda x: pd.Timedelta(str(x)+'D')))

Sample Output:

    arrival_date   days_stay    departure_date
0   2015-02-17     3            2015-02-20
1   2015-01-18     13           2015-01-31
2   2015-02-12     6            2015-02-18
3   2015-01-15     14           2015-01-29
4   2015-03-11     5            2015-03-16

Upvotes: 2

Related Questions