Reputation: 484
I have a pandas data frame of strings as given below.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(97,123,size=(3, 4), dtype=np.uint8).view('S1'), columns=list('ABCD'))
df
Out:
A B C D
0 q g v f
1 l m u u
2 r r j w
I also have a list of column names.
col_list = [['A'], ['A', 'B'], ['A', 'B', 'C']]
I want to slice df
and apply an operation as follows:
df[col_list[1]].values.sum(axis=1)
Out:
array(['qg', 'lm', 'rr'], dtype=object)
Similarly, I need to do this operation for all items in col_list
. I can do this in a for loop, but that will be slow with a large list. Is there any way to vectorize this, so that I can pass col_list
as a numpy array and the result is a numpy 2D array of shape (len(col_list), len(df.index))
.
Point is, it needs to be fast for a large list.
Upvotes: 2
Views: 291
Reputation: 29742
Using numpy
with r_
, cumsum
, and hsplit
:
import numpy as np
arr_list = np.hsplit(df.loc[:, np.r_[[i for l in col_list for i in l]]].values,
np.cumsum(list(map(len, col_list))))
res1 = list(map(lambda x:np.sum(x, 1), arr_list))[:-1]
is about 60x faster than normal loop, in case col_list
has 3000 lists:
col_list = [['A'], ['A', 'B'], ['A', 'B', 'C']] * 1000
numpy
:
%%timeit
arr_list = np.hsplit(df.loc[:, np.r_[[i for l in col_list for i in l]]].values,
np.cumsum(list(map(len, col_list))))
res1 = list(map(lambda x:np.sum(x, 1), arr_list))[:-1]
# 24.3 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
for
loop:
%%timeit
for l in col_list:
df[l].values.sum(axis=1)
# 1.53 s ± 62.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Validation:
all(all(i == j) for i,j in zip(res1, res2))
# True
Upvotes: 2