Reputation: 618
I have this dataframe.
>>> print(df)
a b c d e
0 z z z z z
1 z z z z y
2 z z z x y
3 z z w x y
4 z v w x y
I also have a series.
>>> print(map_class)
class
0 -1
1 0
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
10 9
11 10
My objective is to add the series to each row in the dataframe. My desired output is as follows.
>>> print(result)
a b c d e class
0 z z z z z -1
1 z z z z z 0
2 z z z z z 1
3 z z z z z 2
4 z z z z z 3
5 z z z z z 4
6 z z z z z 5
7 z z z z z 6
8 z z z z z 7
9 z z z z z 8
10 z z z z z 9
11 z z z z z 10
...
48 z v w x y -1
49 z v w x y 0
50 z v w x y 1
51 z v w x y 2
52 z v w x y 3
53 z v w x y 4
54 z v w x y 5
55 z v w x y 6
56 z v w x y 7
57 z v w x y 8
58 z v w x y 9
59 z v w x y 10
Currently I am doing this by use of for
loop. However, the performance is bad. Any other way to do this but not using for
loop? Here is my current code.
result = pd.DataFrame()
for i in range(len(df)):
df_temp_multiple = pd.DataFrame()
df_temp_single = df.iloc[i]
df_temp_multiple = df_temp_multiple.append([df_temp_single]*len(map_class), ignore_index=True)
df_temp_multiple = pd.concat([df_temp_multiple, map_class], axis=1)
result = pd.concat([result, df_temp_multiple], ignore_index=True)
My real dataset is quite huge, more than 10Gb. Therefore, perfomance is really important. Any suggestions will be much appreciated. Thanks!
Upvotes: 3
Views: 1535
Reputation: 221534
Here's an approach making use of NumPy for creating the output data and specifically in it, using NumPy's advanced-indexing
and finally constructing a dataframe from that output data -
m,n,r = df.shape[0], map_class.shape[0], df.shape[1]
out = np.empty((m,n,r+1),dtype=object)
out[:,:,:r] = df.values[:,None,:]
out[:,:,-1] = map_class.values[:,0]
col_names = list(df.columns) + list(map_class.columns)
df_out = pd.DataFrame(out.reshape(m*n,-1), columns=col_names)
Runtime test -
# Loopy version from the question
In [50]: %timeit func0(df, map_class)
100 loops, best of 3: 16 ms per loop
# Proposed one in this post
In [51]: %timeit func1(df, map_class)
10000 loops, best of 3: 152 µs per loop
In [52]: 16000.0/152
Out[52]: 105.26315789473684
100x+
speedup on the sample data and hopefully should scale well for bigger dataset as well.
Upvotes: 2