hgen
hgen

Reputation: 71

Length of all string in the list: the fastest way

I'm trying:

python3 -m timeit -c 'len("".join([str(x) for x in range(0, 999999)]))'
10 loops, best of 3: 330 msec per loop

python3 -m timeit -c 'sum((len(y) for y in [str(x) for x in range(0, 999999)]))
10 loops, best of 3: 439 msec per loop

Why does this happen? Is there a faster way?

P.S. It is assumed that a list of strings will be in advance.

Upvotes: 2

Views: 2329

Answers (4)

IMVinayak3
IMVinayak3

Reputation: 1

We could identify a pattern to calculate the sum of the lengths of the numbers from 0 to n.

Like take for example 999999

Number of 1-digit numbers = 10 (includes 0)

Number of 2-digit numbers = 90 = 99 - 10 + 1 (99 is the largest 2-digit number, and 10 is the smallest)

Number of 3-digit numbers = 900 = 999 - 100 + 1 ...

Similarly, we can calculate the number of numbers with a specific number of digits.

Using this we can multiply the number of digits by the number of numbers with those many digits to get the sum of their lengths.

As for the last case where the upper bound may not be equal to the largest number with the same number of digits. To account for that we can just add it separately.

So now we have reduced the problem to calculating the number of numbers.

n = 999999 # This could be any number

# Calculate the number of digits in n
d = len(str(n))

# Looping from 1 to d - 1 as all numbers will have to be included before 10^d
c = sum((i * 9 * 10 ** (i - 1) for i in range(1, d))) + 1

# Adding lengths of all d-digits number smaller than n (excluding)
c += (n - 10 ** (d - 1)) * d
print(c)

We get the output as 5888884 in 2.455611594719812e-06 seconds

For n = 123456789, we get the output 999999991 in 3.338770055794157e-06 seconds

Upvotes: 0

poke
poke

Reputation: 387707

Ignoring that rather small time difference for now, there is actually a huge difference for your two ways in memory.

sum((len(y) for y in [str(x) for x in range(0, 999999)]))

This will create a string for each number and store that in a list. Then you use a generator expression to loop over that list and sum the lengths up. So you essentially have a string for each number, a list storing all strings, and a number that is being added to for the lengths.

len(''.join([str(x) for x in range(0, 999999)]))

This will again create a string for each number and store that in a list. Then you create a huge string with all numbers. Afterwards you call length on in (which is then a O(1) call). So you don’t have the number you add to (while summing the lengths up), but you do have another long string that combines all the other strings again.

So even if that is faster, you are throwing away a lot of memory, which will likely have an effect on performance later too.

To improve all this, you should consider creating as little stuff permanently as possible. Don’t use list comprehensions as that will actually create the lists; don’t use str.join as that requires a list and iterates it twice.

sum(len(str(x)) for x in range(0, 999999)))

Now, this will still be slower than the len(''.join(…)) method but won’t have that much of a memory overhead. In fact, it will only create one string object at a time, get its length and add it to the sum. The string can then be immediately collected.

The reason this will still be slow though is that it both len and str need to be looked up with every iteration inside of the generator. To speed that up, use map to only look it up twice. wim made a really good suggestion in the comments:

sum(map(len, map(str, range(999999))))

This actually performs faster than the len(''.join(…)) way for me. My timing results in order of being mentioned in my answer:

62.36836282166257
50.54277449168785
58.24419845897603
40.3403849521618

Upvotes: 7

Fred Foo
Fred Foo

Reputation: 363607

A better benchmark with IPython shows the situation is worse than you thought:

>>> lst = [str(x) for x in range(0, 999999)]
>>> %timeit len("".join(lst))
100 loops, best of 3: 9.94 ms per loop
>>> %timeit sum(len(x) for x in lst)
10 loops, best of 3: 62.2 ms per loop

You're seeing two effects here, the overhead of function calls in Python and the overhead of its iteration. "".join doesn't have either because it's a single method call that does a loop in C. Intermediate performance with less memory use can be gotten from map:

>>> %timeit sum(map(len, lst))
10 loops, best of 3: 29.4 ms per loop

Upvotes: 3

Daniel
Daniel

Reputation: 27589

The first (faster) version has 1 call to the len function, 1 call to join and 100k calls to str. Looking at the second line you can see that both len and str are called 100k times each which makes for about twice as many total function calls in the second case.

Upvotes: 2

Related Questions