Reputation: 115
I'm working on a simulation problem where some data needs to be spun up really fast. Here's the problem: let's say we have 2 datasets X and Y and we needed a new dataset Z where each row from X is used to populate new columns for Y, which is repeated for every row of X and then merged together, i.e. final dataset Z has size len(x)*len(y). What would be the most efficient way to create dataset Z? Here's what I have so far:
X = pd.DataFrame({'simulationid':[0,1,2,3],'x1':[10,20,30,40],'x2':[1,2,3,4]})
Y = pd.DataFrame({'timeofday':[5,10,15,20],'var1':[23.5,45.0,46.4,60.3]})
The simplest option (probably the least efficient) would be to iterate over each row and populate a new dataframe:
# loop over every configuration and generate dataset for each,
# then merge at the end
Z = pd.DataFrame()
cols2merge = ['x1','x2']
for index, row in X.iterrows():
for c in cols2merge:
Y[c]=row[c]
Z = pd.concat([Z,Y])
But this ends up taking a REALLY long time when the size of X increases (ranging from 1000 to 10,000). What would a smarter way to do this, taking advantage of vector operations or other Pandas-specific optimizations? I'm thinking there's an elegant one-line merge/concat/join solution but I can't seem to figure it out.
I also tried itertuples
instead of iterrows
as recommended here: https://github.com/pandas-dev/pandas/issues/10334 but didn't notice a significant improvement in execution time.
Thanks in advance!
Upvotes: 4
Views: 145
Reputation: 59579
There are probably faster numpy based solutions, but you can just do an enormous merge:
cols2merge = ['x1','x2']
X[cols2merge].assign(dummy=1).merge(Y.assign(dummy=1), on='dummy').drop(columns='dummy')
x1 x2 timeofday var1
0 10 1 5 23.5
1 10 1 10 45.0
2 10 1 15 46.4
3 10 1 20 60.3
4 20 2 5 23.5
5 20 2 10 45.0
6 20 2 15 46.4
7 20 2 20 60.3
8 30 3 5 23.5
9 30 3 10 45.0
10 30 3 15 46.4
11 30 3 20 60.3
12 40 4 5 23.5
13 40 4 10 45.0
14 40 4 15 46.4
15 40 4 20 60.3
Upvotes: 3