arnold
arnold

Reputation: 618

Python - Pandas: How to add a series to each row in dataframe

I have this dataframe.

>>> print(df)
   a  b  c  d  e
0  z  z  z  z  z
1  z  z  z  z  y
2  z  z  z  x  y
3  z  z  w  x  y
4  z  v  w  x  y

I also have a series.

>>> print(map_class)
    class
0      -1
1       0
2       1
3       2
4       3
5       4
6       5
7       6
8       7
9       8
10      9
11     10

My objective is to add the series to each row in the dataframe. My desired output is as follows.

>>> print(result)
    a  b  c  d  e  class
0   z  z  z  z  z     -1
1   z  z  z  z  z      0
2   z  z  z  z  z      1
3   z  z  z  z  z      2
4   z  z  z  z  z      3
5   z  z  z  z  z      4
6   z  z  z  z  z      5
7   z  z  z  z  z      6
8   z  z  z  z  z      7
9   z  z  z  z  z      8
10  z  z  z  z  z      9
11  z  z  z  z  z     10
...
48  z  v  w  x  y     -1
49  z  v  w  x  y      0
50  z  v  w  x  y      1
51  z  v  w  x  y      2
52  z  v  w  x  y      3
53  z  v  w  x  y      4
54  z  v  w  x  y      5
55  z  v  w  x  y      6
56  z  v  w  x  y      7
57  z  v  w  x  y      8
58  z  v  w  x  y      9
59  z  v  w  x  y     10

Currently I am doing this by use of for loop. However, the performance is bad. Any other way to do this but not using for loop? Here is my current code.

result = pd.DataFrame()

for i in range(len(df)):
    df_temp_multiple = pd.DataFrame()

    df_temp_single = df.iloc[i]

    df_temp_multiple = df_temp_multiple.append([df_temp_single]*len(map_class), ignore_index=True)
    df_temp_multiple = pd.concat([df_temp_multiple, map_class], axis=1)

    result = pd.concat([result, df_temp_multiple], ignore_index=True)

My real dataset is quite huge, more than 10Gb. Therefore, perfomance is really important. Any suggestions will be much appreciated. Thanks!

Upvotes: 3

Views: 1535

Answers (1)

Divakar
Divakar

Reputation: 221534

Here's an approach making use of NumPy for creating the output data and specifically in it, using NumPy's advanced-indexing and finally constructing a dataframe from that output data -

m,n,r = df.shape[0], map_class.shape[0], df.shape[1]

out = np.empty((m,n,r+1),dtype=object)
out[:,:,:r] = df.values[:,None,:]
out[:,:,-1] = map_class.values[:,0]
col_names = list(df.columns) + list(map_class.columns)
df_out = pd.DataFrame(out.reshape(m*n,-1), columns=col_names)

Runtime test -

# Loopy version from the question
In [50]: %timeit func0(df, map_class)
100 loops, best of 3: 16 ms per loop

# Proposed one in this post
In [51]: %timeit func1(df, map_class)
10000 loops, best of 3: 152 µs per loop

In [52]: 16000.0/152
Out[52]: 105.26315789473684

100x+ speedup on the sample data and hopefully should scale well for bigger dataset as well.

Upvotes: 2

Related Questions