Reputation: 1857
I just want to improve the speed of splitting a list.Now I have a way to split the list, but the speed is not as fast as I expected.
def split_list(lines):
return [x for xs in lines for x in xs.split('-')]
import time
lst= []
for i in range(1000000):
lst.append('320000-320000')
start=time.clock()
lst_new=split_list(lst)
end=time.clock()
print('time\n',str(end-start))
For example,Input
:
lst
['320000-320000', '320000-320000']
Output
:
lst_new
['320000', '320000', '320000', '320000']
I'm not satisfied with the speed of spliting,as my data contains many lists.
But now I don't know whether there's a more effective way to do it.
According to advice,I try to describe my whole question more specifically.
import pandas as pd
df = pd.DataFrame({ 'line':["320000-320000, 340000-320000, 320000-340000",
"380000-320000",
"380000-320000,380000-310000",
"370000-320000,370000-320000,320000-320000",
"320000-320000, 340000-320000, 320000-340000",
"380000-320000",
"380000-320000,380000-310000",
"370000-320000,370000-320000,320000-320000",
"320000-320000, 340000-320000, 320000-340000",
"380000-320000",
"380000-320000,380000-310000",
"370000-320000,370000-320000,320000-320000"], 'id':[1,2,3,4,5,6,7,8,9,10,11,12],})
def most_common(lst):
return max(set(lst), key=lst.count)
def split_list(lines):
return [x for xs in lines for x in xs.split('-')]
df['line']=df['line'].str.split(',')
col_ix=df['line'].index.values
df['line_start'] = pd.Series(0, index=df.index)
df['line_destination'] = pd.Series(0, index=df.index)
import time
start=time.clock()
for ix in col_ix:
col=df['line'][ix]
col_split=split_list(col)
even_col_split=col_split[0:][::2]
even_col_split_most=most_common(even_col_split)
df['line_start'][ix]=even_col_split_most
odd_col_split=col_split[1:][::2]
odd_col_split_most=most_common(odd_col_split)
df['line_destination'][ix]=odd_col_split_most
end=time.clock()
print('time\n',str(end-start))
del df['line']
print('df\n',df)
Input
:
df
id line
0 1 320000-320000, 340000-320000, 320000-340000
1 2 380000-320000
2 3 380000-320000,380000-310000
3 4 370000-320000,370000-320000,320000-320000
4 5 320000-320000, 340000-320000, 320000-340000
5 6 380000-320000
6 7 380000-320000,380000-310000
7 8 370000-320000,370000-320000,320000-320000
8 9 320000-320000, 340000-320000, 320000-340000
9 10 380000-320000
10 11 380000-320000,380000-310000
11 12 370000-320000,370000-320000,320000-320000
Output
:
df
id line_start line_destination
0 1 320000 320000
1 2 380000 320000
2 3 380000 320000
3 4 370000 320000
4 5 320000 320000
5 6 380000 320000
6 7 380000 320000
7 8 370000 320000
8 9 320000 320000
9 10 380000 320000
10 11 380000 320000
11 12 370000 320000
You can regard the number of line
(eg.320000-32000
represent the starting point and destination of the route).
Expected
:
Make the code run faster.(I can't bear the speed of the code)
Upvotes: 2
Views: 603
Reputation: 53029
'-'.join(lst).split('-')
seems quite a bit faster:
>>> timeit("'-'.join(lst).split('-')", globals=globals(), number=10)
1.0838123590219766
>>> timeit("[x for xs in lst for x in xs.split('-')]", globals=globals(), number=10)
3.1370303670410067
Upvotes: 3
Reputation: 531430
Pushing more of the work below the Python level seems to provide a small speedup:
In [7]: %timeit x = split_list(lst)
407 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %timeit x = list(chain.from_iterable(map(methodcaller("split", "-"), lst
...: )))
374 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
methodcaller
creates a function that calls the function for you:
methodcaller("split", "-")(x) == x.split("-")
chain.from_iterable
creates a single iterator consisting of the elements from a group of iterables:
list(chain.from_iterable([[1,2], [3,4]])) == [1,2,3,4]
map
ping the function returned by methodcaller
on to your list of strings produces an iterable of lists suitable for flattening by from_iterable
. The benefit of this more functional approach is that the functions involved are all implemented in C and can work with the data in the Python objects, rather than Python byte code that works on the Python objects.
Upvotes: 1
Reputation: 22314
Depending on what you want to do with your list, using a genertor can be slightly faster.
If you need to keep the output stored, then the list solution is faster.
If all you need to is to iterate over the words once, you can get rid of some overhead by using a generator.
def split_list_gen(lines):
for line in lines:
yield from line.split('-')
import time
lst = ['32000-32000'] * 10000000
start = time.clock()
for x in split_list(lst):
pass
end = time.clock()
print('list time:', str(end - start))
start = time.clock()
for y in split_list_gen(lst):
pass
end = time.clock()
print('generator time:', str(end - start))
The generator solution is consistently about 10% faster.
list time: 0.4568295369982612
generator time: 0.4020671741918084
Upvotes: 2