Ivan Stimac
Ivan Stimac

Reputation: 59

dataframe.str[start:stop] where start and stop are columns in same data frame

I would like to use pandas.str to vectorize slice operation on pandas column which values are list and start and stop values are int values in start and stop columns of same dataframe example:

df['column_with_list_values'].str[start:stop]

df[['list_values', 'start', 'stop']]
                list_values  start  stop
0              [5, 7, 6, 8]           0               2
1  [1, 3, 5, 7, 2, 4, 6, 8]           1               3
2  [1, 3, 5, 7, 2, 4, 6, 8]           0               2
3  [1, 3, 5, 7, 2, 4, 6, 8]           0               2
4  [1, 3, 5, 7, 2, 4, 6, 8]           1               3
5  [1, 3, 5, 7, 2, 4, 6, 8]           2               4
6  [1, 3, 5, 7, 2, 4, 6, 8]           0               2

and result would be
    0    [5, 7]
    1    [3, 5]
    2    [1, 3]
    3    [1, 3]
    4    [3, 5]
    5    [5, 7]
    6    [1, 3]

Thanks!

Upvotes: 1

Views: 377

Answers (2)

BeRT2me
BeRT2me

Reputation: 13242

df.apply(lambda x: x.list_values[x.start:x.stop], axis=1)

Output:

0    [5, 7]
1    [3, 5]
2    [1, 3]
3    [1, 3]
4    [3, 5]
5    [5, 7]
6    [1, 3]
dtype: object

I'm not sure why, but the fastest variation appears to be:

df['sliced'] = [lst[start:stop] for lst, start, stop in zip(df.list_values.tolist(), df.start.tolist(), df.stop.tolist())]

My testing:

df = pd.DataFrame({'list_values': {0: [5, 7, 6, 8], 1: [1, 3, 5, 7, 2, 4, 6, 8], 2: [1, 3, 5, 7, 2, 4, 6, 8], 3: [1, 3, 5, 7, 2, 4, 6, 8], 4: [1, 3, 5, 7, 2, 4, 6, 8], 5: [1, 3, 5, 7, 2, 4, 6, 8], 6: [1, 3, 5, 7, 2, 4, 6, 8]}, 'start': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1, 5: 2, 6: 0}, 'stop': {0: 2, 1: 3, 2: 2, 3: 2, 4: 3, 5: 4, 6: 2}})
df = pd.concat([df]*100000)
# Shape is now (700000, 3)

def v1(df):
    temp = df.copy()
    temp['sliced'] = [lst[start:stop] for lst, start, stop in temp.values.tolist()]

def v2(df):
    temp = df.copy()
    temp['sliced'] = [lst[start:stop] for lst, start, stop in zip(temp.list_values, temp.start, temp.stop)]

def v3(df):
    temp = df.copy()
    temp['sliced'] = [lst[start:stop] for lst, start, stop in temp.values]

def v4(df):
    temp = df.copy()
    temp['sliced'] = [lst[start:stop] for lst, start, stop in zip(df.list_values.tolist(), df.start.tolist(), df.stop.tolist())]

def v5(df):
    temp = df.copy()
    temp['sliced'] = temp.apply(lambda x: x.list_values[x.start:x.stop], axis=1)

%timeit -n 10 v1(df)
%timeit -n 10 v2(df)
%timeit -n 10 v3(df)
%timeit -n 10 v4(df)
%timeit v5(df)

Output:

# v1: temp.values.tolist()
235 ms ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# v2: zip(temp.list_values, temp.start, temp.stop)
249 ms ± 9.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# v3: temp.values
578 ms ± 6.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# v4: zip(df.list_values.tolist(), df.start.tolist(), df.stop.tolist())
149 ms ± 8.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# v5: apply
12.1 s ± 165 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But yes, the list comprehension method, no matter what variation, is significantly faster than using apply.


Third update:

I figured out how to sort of vectorize this problem, using groupby and transform. Still not quite as good as the best list comprehension in my testing, but pretty darn good.

def v6(df):
    temp = df.copy()
    temp['sliced'] = temp.groupby(['start','stop'])['list_values'].transform(lambda x: x.str[x.name[0]:x.name[1]])

# v6: groupby
256 ms ± 5.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 2

Rodalm
Rodalm

Reputation: 5503

You can try a list comprehension. It should be faster than using DataFrame.apply for big datasets, since DataFrame.apply introduces a big overhead.

df['sliced'] = [lst[start:stop] for lst, start, stop in zip(df.list_values, df.start, df.stop)]

Output:

>>> df

                list_values  start  stop  sliced
0              [5, 7, 6, 8]      0     2  [5, 7]
1  [1, 3, 5, 7, 2, 4, 6, 8]      1     3  [3, 5]
2  [1, 3, 5, 7, 2, 4, 6, 8]      0     2  [1, 3]
3  [1, 3, 5, 7, 2, 4, 6, 8]      0     2  [1, 3]
4  [1, 3, 5, 7, 2, 4, 6, 8]      1     3  [3, 5]
5  [1, 3, 5, 7, 2, 4, 6, 8]      2     4  [5, 7]
6  [1, 3, 5, 7, 2, 4, 6, 8]      0     2  [1, 3]

Look at BeRT2me benchmark for possible further improvements.

Upvotes: 2

Related Questions