Reputation: 65
I need to simulate some transactional data using numpy and pandas, similar to the code below:
import random
import numpy as np
import pandas as pd
n=1000
sample_df = pd.DataFrame({
'arrival_date':np.random.choice( pd.date_range('1/1/2015', periods=n,
freq='D'), n),
'days_stay': [random.randint(1,14) for x in range(n)]
})
The dataframe needs to have 3 fields, two calculated similarly above PLUS another date field that adds the values of two fields:
'departure_date': 'arrival_date' + 'days_stay'
The catch is that I would prefer to define all three fields within the pandas dataframe constructor, and not have to define a function for this last field and then refer to it in a second dataframe step to get the data.
sample_df = pd.DataFrame({
'arrival_date':np.random.choice( pd.date_range('1/1/2015', periods=n,
freq='D'), n),
'days_stay': [random.randint(1,14) for x in range(n)],
'departure_date': 'arrival_date' + 'days_stay'
})
Is this possible?
Thanks in advance.
Upvotes: 3
Views: 615
Reputation: 65
@gyx-hh and @jpp
Thank you both for answering my questions. I at least have an answer as to whether this can be done within the dataframe constructor (it can't), and an alternative way to do this with the assign option.
The traditional option of adding the columns in a second step would seem to work, but needed a little tweaking.
sample_df['departure_date'] = sample_df['arrival_date'] +
pd.TimedeltaIndex(sample_df['days_stay'], unit='D')
Thanks again to you both! Marking as accepted.
Upvotes: 0
Reputation: 164653
The catch is that I would prefer to define all three fields within the pandas dataframe constructor
This isn't possible. You can't define a series from two other series which have not yet been created. You can use a pd.DataFrame.assign
trick to add your series using method chaining in a subsequent step. But, in my opinion, there's nothing better than:
df['departure_date'] = df['arrival_date'] + df['days_stay']
Upvotes: 1
Reputation: 1431
Try the following. Using assign on pd.Dataframe() we get to use the created df and it's data and assign a new column.
sample_df = pd.DataFrame({
'arrival_date':np.random.choice( pd.date_range('1/1/2015', periods=n,
freq='D'), n),
'days_stay': [random.randint(1,14) for x in range(n)],
}).assign(departure_date = lambda x: x.arrival_date + x.days_stay.apply(lambda x: pd.Timedelta(str(x)+'D')))
Sample Output:
arrival_date days_stay departure_date
0 2015-02-17 3 2015-02-20
1 2015-01-18 13 2015-01-31
2 2015-02-12 6 2015-02-18
3 2015-01-15 14 2015-01-29
4 2015-03-11 5 2015-03-16
Upvotes: 2