santoku
santoku

Reputation: 3427

how to replace patterns sequentially in python when pattern includes duplicates

I have a list of patterns and a list of replacements. The pattern contains repeating elements but they correspond to different replacements.

txt=132GOasmHOMEwokdslNOWsdwkGO239NOW
pattern=['GO','HOME','NOW','GO','NOW']
REPLACEMENT=['why','nope','later','aha','genes']

The desired output would be 132whyasmnopewokdsllatersdwkaha239genes

What's the most efficient way to accomplish the sequential replacement?

Upvotes: 3

Views: 282

Answers (4)

keithpjolley
keithpjolley

Reputation: 2263

Using a dict reduces the number of items you need to iterate over which may be valuable for some long inputs.

txt = '132GOasmHOMEwokdslNOWsdwkGO239NOW'
pattern = ['GO','HOME','NOW','GO','NOW']
REPLACEMENT = ['why','nope','later','aha','genes']

x = dict(zip(reversed(pattern), reversed(REPLACEMENT)))
for k in x:
  txt = txt.replace(k,x[k], 1)
print(txt)

Edit: for fun I added a benchmark to backup to illustrate that reducing the number of items you need to iterate over may be valuable for some long inputs. What is most efficient isn't always apparent when you are using a trivial test data set.

 #! /usr/bin/env python
# -*- coding: UTF8 -*- 

def alpha(pattern, REPLACEMENT, txt):
  for a,b in zip(pattern,REPLACEMENT):
    txt=txt.replace(a,b,1)

def beta(pattern, REPLACEMENT, txt):
  for i,x in enumerate(pattern):
    txt = txt.replace(x,REPLACEMENT[i], 1)

def gamma(pattern, REPLACEMENT, txt):
  x = dict(zip(reversed(pattern), reversed(REPLACEMENT)))
  for k in x:
    txt = txt.replace(k,x[k], 1)

def delta(pattern, REPLACEMENT, txt):
  new_d = iter(REPLACEMENT)
  new_result = re.sub('\b' + '|'.join(pattern) + '\b', lambda _: next(new_d), txt)

if __name__ == '__main__':
  import timeit, re

  txt = '132GOasmHOMEwokdslNOWsdwkGO239NOW'
  pattern = ['GO','HOME','NOW','GO','NOW']
  REPLACEMENT = ['why','nope','later','aha','genes']

  print("Trivial inputs:  len(pattern): {}, len(REPLACEMENT): {}, len(txt): {}".format(len(pattern), len(REPLACEMENT), len(txt)));
  print("alpha: ", timeit.timeit("alpha(pattern, REPLACEMENT, txt)", setup="from __main__ import alpha, txt, pattern, REPLACEMENT"))
  print("beta:  ", timeit.timeit("beta( pattern, REPLACEMENT, txt)", setup="from __main__ import beta,  txt, pattern, REPLACEMENT"))
  print("gamma: ", timeit.timeit("gamma(pattern, REPLACEMENT, txt)", setup="from __main__ import gamma, txt, pattern, REPLACEMENT"))
  print("delta: ", timeit.timeit("delta(pattern, REPLACEMENT, txt)", setup="from __main__ import delta, txt, pattern, REPLACEMENT"))
  print("")

  txtcopy = txt
  patterncopy = pattern.copy()
  REPLACEMENTcopy = REPLACEMENT.copy()

  for _ in range(3):
    txt = txt + txtcopy
    pattern.extend(patterncopy)
    REPLACEMENT.extend(REPLACEMENTcopy)

  print("Small inputs: len(pattern): {}, len(REPLACEMENT): {}, len(txt): {}".format(len(pattern), len(REPLACEMENT), len(txt)));
  print("alpha: ", timeit.timeit("alpha(pattern, REPLACEMENT, txt)", setup="from __main__ import alpha, txt, pattern, REPLACEMENT"))
  print("beta:  ", timeit.timeit("beta( pattern, REPLACEMENT, txt)", setup="from __main__ import beta,  txt, pattern, REPLACEMENT"))
  print("gamma: ", timeit.timeit("gamma(pattern, REPLACEMENT, txt)", setup="from __main__ import gamma, txt, pattern, REPLACEMENT"))
  print("delta: ", timeit.timeit("delta(pattern, REPLACEMENT, txt)", setup="from __main__ import delta, txt, pattern, REPLACEMENT"))
  print("")

  txt = txtcopy
  pattern = patterncopy.copy()
  REPLACEMENT = REPLACEMENTcopy.copy()

  for _ in range(300):
    txt = txt + txtcopy
    pattern.extend(patterncopy)
    REPLACEMENT.extend(REPLACEMENTcopy)

  print("Larger inputs: len(pattern): {}, len(REPLACEMENT): {}, len(txt): {}".format(len(pattern), len(REPLACEMENT), len(txt)));
  print("alpha: ", timeit.timeit("alpha(pattern, REPLACEMENT, txt)", setup="from __main__ import alpha, txt, pattern, REPLACEMENT"))
  print("beta:  ", timeit.timeit("beta(pattern, REPLACEMENT, txt)", setup="from __main__ import beta,  txt, pattern, REPLACEMENT"))
  print("gamma: ", timeit.timeit("gamma(pattern, REPLACEMENT, txt)", setup="from __main__ import gamma, txt, pattern, REPLACEMENT"))
  print("delta: ", timeit.timeit("delta(pattern, REPLACEMENT, txt)", setup="from __main__ import delta, txt, pattern, REPLACEMENT"))

Results:

Trivial inputs:  len(pattern): 5, len(REPLACEMENT): 5, len(txt): 33
alpha:  4.60048107800003
beta:   4.169088881999869
gamma:  5.7612637450001785
delta:  11.371387353000046

Small inputs: len(pattern): 20, len(REPLACEMENT): 20, len(txt): 132
alpha:  17.281149661999734
beta:   15.131949634000193
gamma:  7.339897444000144
delta:  26.50896787900001

Larger inputs: len(pattern): 1505, len(REPLACEMENT): 1505, len(txt): 9933
alpha:  18766.660852467998
beta:   17640.960064803
gamma:  64.01868645999639
delta:  901.3577002189995

So, for trivial inputs the enumerate solution is a tiny bit faster than zip and a lot faster than iter. When the length of the inputs is increased slightly then the cost of not removing duplicates starts to show and my solution runs in less than half the time. When a long input with lots of duplicates is run then @eatmeimadanish solution takes 27555% longer to complete than when duplicates are removed. Ouch.

Upvotes: 0

eatmeimadanish
eatmeimadanish

Reputation: 3907

txt='132GOasmHOMEwokdslNOWsdwkGO239NOW'
pattern=['GO','HOME','NOW','GO','NOW']
REPLACEMENT=['why','nope','later','aha','genes']

for i,x in enumerate(pattern):
    txt = txt.replace(x,REPLACEMENT[i], 1)

For fun, here are the time tests, since the question asked for most efficient.

pattern=['GO','HOME','NOW','GO','NOW']
REPLACEMENT=['why','nope','later','aha','genes']

t = time.time()
for z in xrange(1000000):
    txt = '132GOasmHOMEwokdslNOWsdwkGO239NOW'
    for a,b in zip(pattern,REPLACEMENT):
        txt=txt.replace(a,b,1)
print time.time() - t

t = time.time()
for z in xrange(1000000):
    txt2 = '132GOasmHOMEwokdslNOWsdwkGO239NOW'
    for i,x in enumerate(pattern):
        txt2 = txt2.replace(x,REPLACEMENT[i], 1)
print time.time() - t

t = time.time()
for z in xrange(1000000):
    txt3 = '132GOasmHOMEwokdslNOWsdwkGO239NOW'
    x = dict(zip(reversed(pattern), reversed(REPLACEMENT)))
    for k in x:
      txt3 = txt3.replace(k,x[k], 1)
print time.time() - t


t = time.time()
for z in xrange(1000000):
    txt = '132GOasmHOMEwokdslNOWsdwkGO239NOW'
    new_d = iter(REPLACEMENT)
    new_result = re.sub('\b' + '|'.join(pattern) + '\b', lambda _: next(new_d), txt)
print time.time() - t

Which results to:

2.57099986076
2.48500013351
3.50499987602
4.23699998856

As you can see, enumerate is slightly more efficient than zip, and the other two are not in the same ballpark.

Upvotes: 0

Timmy
Timmy

Reputation: 125

I think you should try this:

import re
txt = "132GOasmHOMEwokdslNOWsdwkGO239NOW"
pattern = ['GO','HOME','NOW','GO','NOW']
REPLACEMENT = ['why','nope','later','aha','genes'] 
txt1 = re.sub(pattern[1], REPLACEMENT[1], txt)
txt2 = re.sub(pattern[2], REPLACEMENT[2], txt1)
txt3 = re.sub(pattern[3], REPLACEMENT[3], txt2)
txt4 = re.sub(pattern[4], REPLACEMENT[4], txt3)
FINAL_TEXT = re.sub(pattern[5], REPLACEMENT[5], txt4)
print(FINAL_TEXT)

And the output:

"132whyasmnotwokdsllatersdwkaha239genes"

Upvotes: -1

user1763510
user1763510

Reputation: 1260

You can loop through the two lists at the same time, and only replace the first instance of the pattern each time:

for a,b in zip(pattern,REPLACEMENT):
    txt=txt.replace(a,b,1)

Upvotes: 3

Related Questions