mchangun
mchangun

Reputation: 10322

Characters in Strings in Python

Implement a function with signature find_chars(string1, string2) that takes two strings and returns a string that contains only the characters found in string1 and string two in the order that they are found in string1. Implement a version of order N*N and one of order N.

(Source: http://thereq.com/q/138/python-software-interview-question/characters-in-strings)

Here are my solutions:

Order N*N:

def find_chars_slow(string1, string2):
    res = []
    for char in string1:
        if char in string2:
            res.append(char)

    return ''.join(res)

So the for loop goes through N elements, and each char in string2 check is another N operations so this gives N*N.

Order N:

from collections import defaultdict

def find_char_fast(string1, string2):
    d = defaultdict(int)

    for char in string2:
        d[char] += 1

    res = []

    for char in string1:
        if char in d:
            res.append(char)

    return ''.join(res)

First store the characters of string2 as a dictionary (O(N)). Then scan string1 (O(N)) and check if it is in the dict (O(1)). This gives a total runtime of O(2N) = O(N).

Is the above correct? Is there a faster method?

Upvotes: 2

Views: 1771

Answers (4)

James Sapam
James Sapam

Reputation: 16940

Hi, I am trying to profiling the various solution given here:

In my snippet, I am using a module called faker to generate fake words so i can test on very long string more than 20k characters:

Snippet:

from faker import Faker
from timeit import Timer
from collections import defaultdict

def first(string1, string2):
    sets = set(string2)
    return ''.join((c for c in string1 if c in sets))

def second(s1, s2):
    res = []
    for char in string1:
        if char in string2:
            res.append(char)
    return ''.join(res)

def third(s1, s2):
    d = defaultdict(int)
    for char in string2:
        d[char] += 1
    res = []
    for char in string1:
        if char in d:
            res.append(char)
    return ''.join(res)

f = Faker()
string1 = ''.join(f.paragraph(nb_sentences=10000).split())
string2  = ''.join(f.paragraph(nb_sentences=10000).split())

funcs = [first, second, third]

import cProfile

print 'Length of String1: ', len(string1)
print 'Length of String2: ', len(string2)

print 'Time taken to execute:'
for f in funcs:
    t = Timer(lambda: f(string1, string2))
    print f.__name__, cProfile.run('t.timeit(number=100)')

Output:

Length of String1:  525133
Length of String2:  501050
Time taken to execute:
first          52513711 function calls in 18.169 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
      100    0.001    0.000   18.164    0.182 s.py:39(<lambda>)
      100    1.723    0.017   18.163    0.182 s.py:5(first)
 52513400    9.442    0.000    9.442    0.000 s.py:7(<genexpr>)
        1    0.000    0.000    0.000    0.000 timeit.py:143(setup)
        1    0.000    0.000   18.169   18.169 timeit.py:178(timeit)
        1    0.005    0.005   18.169   18.169 timeit.py:96(inner)
        1    0.000    0.000    0.000    0.000 {gc.disable}
        1    0.000    0.000    0.000    0.000 {gc.enable}
        1    0.000    0.000    0.000    0.000 {gc.isenabled}
        1    0.000    0.000    0.000    0.000 {globals}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      100    6.998    0.070   16.440    0.164 {method 'join' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {time.time}


None
second          52513611 function calls in 22.280 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   22.280   22.280 <string>:1(<module>)
      100    0.121    0.001   22.275    0.223 s.py:39(<lambda>)
      100   16.957    0.170   22.153    0.222 s.py:9(second)
        1    0.000    0.000    0.000    0.000 timeit.py:143(setup)
        1    0.000    0.000   22.280   22.280 timeit.py:178(timeit)
        1    0.005    0.005   22.280   22.280 timeit.py:96(inner)
        1    0.000    0.000    0.000    0.000 {gc.disable}
        1    0.000    0.000    0.000    0.000 {gc.enable}
        1    0.000    0.000    0.000    0.000 {gc.isenabled}
        1    0.000    0.000    0.000    0.000 {globals}
 52513300    4.018    0.000    4.018    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      100    1.179    0.012    1.179    0.012 {method 'join' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {time.time}


None
third          52513611 function calls in 28.184 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   28.184   28.184 <string>:1(<module>)
      100   22.847    0.228   28.059    0.281 s.py:16(third)
      100    0.120    0.001   28.179    0.282 s.py:39(<lambda>)
        1    0.000    0.000    0.000    0.000 timeit.py:143(setup)
        1    0.000    0.000   28.184   28.184 timeit.py:178(timeit)
        1    0.005    0.005   28.184   28.184 timeit.py:96(inner)
        1    0.000    0.000    0.000    0.000 {gc.disable}
        1    0.000    0.000    0.000    0.000 {gc.enable}
        1    0.000    0.000    0.000    0.000 {gc.isenabled}
        1    0.000    0.000    0.000    0.000 {globals}
 52513300    4.032    0.000    4.032    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      100    1.180    0.012    1.180    0.012 {method 'join' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {time.time}


None

Conclusion:

So, the first function with comprehension is the fastest.

But when you run on strings size around 25K characters second functions wins.

Length of String1:  22959
Length of String2:  452919
Time taken to execute:
first          2296311 function calls in 2.216 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
      100    0.000    0.000    2.216    0.022 s.py:39(<lambda>)
      100    1.530    0.015    2.216    0.022 s.py:5(first)
  2296000    0.402    0.000    0.402    0.000 s.py:7(<genexpr>)
        1    0.000    0.000    0.000    0.000 timeit.py:143(setup)
        1    0.000    0.000    2.216    2.216 timeit.py:178(timeit)
        1    0.000    0.000    2.216    2.216 timeit.py:96(inner)
        1    0.000    0.000    0.000    0.000 {gc.disable}
        1    0.000    0.000    0.000    0.000 {gc.enable}
        1    0.000    0.000    0.000    0.000 {gc.isenabled}
        1    0.000    0.000    0.000    0.000 {globals}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      100    0.284    0.003    0.686    0.007 {method 'join' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {time.time}


None
second          2296211 function calls in 0.939 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.939    0.939 <string>:1(<module>)
      100    0.003    0.000    0.939    0.009 s.py:39(<lambda>)
      100    0.729    0.007    0.936    0.009 s.py:9(second)
        1    0.000    0.000    0.000    0.000 timeit.py:143(setup)
        1    0.000    0.000    0.939    0.939 timeit.py:178(timeit)
        1    0.000    0.000    0.939    0.939 timeit.py:96(inner)
        1    0.000    0.000    0.000    0.000 {gc.disable}
        1    0.000    0.000    0.000    0.000 {gc.enable}
        1    0.000    0.000    0.000    0.000 {gc.isenabled}
        1    0.000    0.000    0.000    0.000 {globals}
  2295900    0.165    0.000    0.165    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      100    0.042    0.000    0.042    0.000 {method 'join' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {time.time}


None
third          2296211 function calls in 8.361 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    8.361    8.361 <string>:1(<module>)
      100    8.145    0.081    8.357    0.084 s.py:16(third)
      100    0.004    0.000    8.361    0.084 s.py:39(<lambda>)
        1    0.000    0.000    0.000    0.000 timeit.py:143(setup)
        1    0.000    0.000    8.361    8.361 timeit.py:178(timeit)
        1    0.000    0.000    8.361    8.361 timeit.py:96(inner)
        1    0.000    0.000    0.000    0.000 {gc.disable}
        1    0.000    0.000    0.000    0.000 {gc.enable}
        1    0.000    0.000    0.000    0.000 {gc.isenabled}
        1    0.000    0.000    0.000    0.000 {globals}
  2295900    0.169    0.000    0.169    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      100    0.043    0.000    0.043    0.000 {method 'join' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {time.time}


None

Upvotes: 1

roippi
roippi

Reputation: 25954

Your solution is algorithmically correct (the first is O(n**2), and the second is O(n)) but you're doing some things that are going to be possible red flags to an interviewer.

The first function is basically okay. You might get bonus points for writing it like this:

''.join([c for c in string1 if c in string2])

..which does essentially the same thing.

My problem (if I'm wearing my interviewer pants) with how you've written the second function is that you use a defaultdict where you don't care at all about the count - you only care about membership. This is the classic case for when to use a set.

seen = set(string2)

''.join([c for c in string1 if c in seen])

The way I've written these functions are going to be slightly faster than what you wrote, since list comprehensions loop in native code rather than in python bytecode. They are algorithmically the same complexity.

Upvotes: 3

thefourtheye
thefourtheye

Reputation: 239483

The algorithms you have used are perfectly fine. There are few improvements you can do

  1. Since you are converting the second string to a dictionary, I would recommend using set, like this

    d = set(string2)
    
  2. Apart from that you can use list comprehension, as a filter, like this

    return "".join([char for char in string1 if char in d])
    
  3. If the order of the characters in the output doesn't matter, you can simply convert both the strings to sets and simply find set difference, like this

    return "".join(set(string1) - set(string2))
    

Upvotes: 1

loopbackbee
loopbackbee

Reputation: 23322

Your method is sound, and there is no method with time complexity less than O(N), since you obviously need to go through each character at least once.

That is not to say that there's no method that runs faster. There's no need to actually increment the numbers in the dictionary. You could, for example, use a set. You could also make further use of python's features, such as list comprehensions/generators:

def find_char_fast2(string1, string2):
    s= set(string2)
    return "".join( (x for x in string1 if x in s) )

Upvotes: 1

Related Questions