Reputation: 751
I have a DataFrame as follow, where the column length can increase a lot. I am looking to create a new column based on the sum of each row
|---------------------|------------------|------------------|------------------|
| A | B | C | Total |
|---------------------|------------------|------------------|------------------|
| x | 34 | 8 | 42 |
|---------------------|------------------|------------------|------------------|
| y | 43 | 12 | 55 |
|---------------------|------------------|------------------|------------------|
| z | 6 | 321 | 327 |
|---------------------|------------------|------------------|------------------|
I know I can easily do: df['Total'] = df['B'] + df['C']
However I am looking for a better technique as my column length can be very more important
Upvotes: 1
Views: 4466
Reputation: 582
apply
could be painfully slow for dataframes with large number of rows. Avoid it whenever possible. Here is a work-around.
cols_to_sum = [<columns to sum over>]
df['Total'] = df[cols_to_sum].sum(axis = 1)
Here is a performance comparison between the two methods:
df = pd.DataFrame({"a" : np.random.randn(100000),
"b": np.random.randn(100000),
"c": np.random.randn(100000),
"d": np.random.randn(100000),
"e": np.random.randn(100000)})
cols_to_sum = list('abcde')
%%time
result1 = df[cols_to_sum].apply(np.sum, axis = 1)
>> CPU times: user 7.88 s, sys: 39.7 ms, total: 7.92 s
>> Wall time: 7.89 s
%%time
result2 = df[cols_to_sum].sum(axis = 1)
>> CPU times: user 9.51 ms, sys: 0 ns, total: 9.51 ms
>> Wall time: 17.5 ms
print((result1 == result2).all())
>> True
That's ~ 400x speed-up for you.
Upvotes: 1
Reputation: 4618
You can use apply in the row like this:
df['Total'] = df.apply(np.sum, axis=1)
If you want to jump the first column, you can use .loc:
df['Total'] = df.loc[:, 1:].apply(np.sum, axis=1)
Upvotes: 3