Dimgold
Dimgold

Reputation: 2944

efficient string char elimination in python

I have a python for loop over a string, that have to decide char-by-char if the current char should stay or be removed.

Let's say I have a function fun(character,state) which receives a char, and some state parameter, and returns True if the char should be kept, and a new state. for example:

def fun(c,state):
    if state%5==0:
        return False, state+1
    else:
        return True, state+2

I have two ideas how to iterate (sequentially) over the string with this function, but I can't decide between them (or others) in terms of runtime complexity and space usage tradeoff:

Option 1:

def option1(string):
    state = 0
    i =0
    while i<len(string):
        answer, state = fun(string[i],state)
        if not answer:
             string = string[:i]+string[i+1:]
        else:
            i+=1
    return string

Option 2:

def option2(string):
    result  = ''
    state = 0
    for c in string:
        answer, state = fun(c,state)
        if answer:
             result+=c
   return result

Does anyone has a better approach/ solution?


edit: some timing results:

on a long string ( 67976 chars) the results are:

looks like eliminating the i'th char with string[:i]+string[i+1:] is not a good idea.

Upvotes: 0

Views: 107

Answers (2)

pvg
pvg

Reputation: 2729

In CPython, a good performance rule of thumb is 'avoid function calls in tight loops'. The overhead tends to take a significant amount of time compared to simple iterative computations. In your case, a 'brute force' approach is to simply inline the logic in the loop that traverses your string:

def option4(strng):
    result  = []
    state = 0
    for c in strng:
        if state % 5 == 0:
            state += 1
        else:
            result.append(c)
            state += 2
    return ''.join(result)

This may feel a little clunky, a more 'pythonic' approach might be to use a python generator that takes an iterable and initial state as input and yields the matching members.

def filtergen(iter, initstate):
    state = initstate
    for c in iter:
        if state % 5 == 0:
            state += 1
        else:
            state += 2
            yield c

def option5(strng):
    return ''.join(filtergen(strng, 0))

This reduces call overhead - you only need as many calls as matching elements and as a side benefit, frees the caller from the need to keep track of state. Additionally, you no longer need the intermediate list. The filter remains generic - it works on any kind of iterable, rather than just strings.

Putting all of these together and measuring performance:

s = 'T0kzcZ0x8VQY8ulgFKU8D1MlIUULRsVsNZMnXbjliUEES6sEIVUzpjxlHLG59z' * 1200

def fun(c, state):
    if state % 5 == 0:
        return False, state + 1
    else:
        return True, state + 2

def option3(strng):
    result  = []
    state = 0
    for c in strng:
        answer, state = fun(c, state)
        if answer:
             result.append(c)
    return ''.join(result)

def option4(strng):
    result  = []
    state = 0
    for c in strng:
        if state % 5 == 0:
            state += 1
        else:
            result.append(c)
            state += 2
    return ''.join(result)

def filtergen(iter, initstate):
    state = initstate
    for c in iter:
        if state % 5 == 0:
            state += 1
        else:
            state += 2
            yield c

def option5(strng):
    return ''.join(filtergen(strng, 0))

import timeit
print("string length is " + str(len(s)))
print(timeit.timeit('option3(s)', globals=globals(), number=100))
print(timeit.timeit('option4(s)', globals=globals(), number=100))
print(timeit.timeit('option5(s)', globals=globals(), number=100))

print(option3(s) == option(4) == option5(s))

On my dinky laptop, the output looks like this (times are in seconds):

pvg /tmp ➤  python3 s.py
string length is 74400
2.488818967998668
1.3291878789968905
1.268602224998176
True

The 'inline logic' and generator both easily outperform the naive call-per-element approach.

Upvotes: 1

Mike M&#252;ller
Mike M&#252;ller

Reputation: 85482

String concatenation with += is not recommended. Better use a list and join it.

def option2(string):
    result  = []
    state = 0
    for c in string:
        answer, state = fun(c, state)
        if answer:
             result.append(c)
   return ''.join(result)

This a good explanation why. While += is optimized in CPython (for certain cases?), it can be really slow in other implementations such as PyPy, for example.

Upvotes: 3

Related Questions