Reputation: 7222

How do I replace text from beginning to end by shortest to longest match

I have the data:

In [1807]: text='110011111000011010110011'

and a dictionary that looks like this:

s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}

and i want the output of text replaced to

'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'

So the first 11, changes to ++--, the second match is 001, which is then --, the third match is 11 which is then replaced with ++--. The shortest match replaces first, but if not found, then the second longest match is then replaced. This has to be iteratively from the beginning of the string, or it does not work. I'm looking for an idiomatic approach to this issue.

The current code i use works, but i think there is a more idiomatic way to solve this issue. This is what i currently have:


s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}

peekstart=0
peekend=1
sptr='110011111000011010110011'
bytestr=b''
while peekstart+peekend <= sptrlen:
  match=False
  while match == False:
    peek = sptr[peekstart:peekstart+peekend] 
    findptr = s.get(peek)
    while findptr == None:
       peek = sptr[peekstart:peekstart+peekend+stream]
       stream=stream+1
       findptr = s.get(peek)
    if stream == 0:
      peekstart=peekstart+peekend+stream
    else:
      peekstart=peekstart+peekend+stream-1
    stream=0
    bytestr+=s.get(peek).encode()
    match = True
print(bytestr)
b'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'

Is there a more idiomatic way to solve this problem?

ANSWERED

Using the answers from here i came up with this new version that uncompresses very nicely and deals with not needing a buffer for binaries that start with zeros:

from functools import reduce
s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}

def decompressHuffmanCode(a, bit):
    return ('', a[1] + a[2][a[0]+bit[0]], a[2]) if (a[0]+bit[0] in a[2]) else (a[0]+bit[0], a[1], a[2])

# Added one to the front to deal with cases of the binary starting with zeroes
compressed_binary='1110011111000011010110011'
read_compressed_binary = compressed_binary[1:]
remainder,bytestr,s = reduce(decompressHuffmanCode, read_compressed_binary, ('', '', s))
print(bytestr)

Thanks to everyone for their replies, they lead to this shortened version

Upvotes: 1

Answers (4)

RoyM

Reputation: 746

This is a common eat-function. You eat from left until you reach something that matches in your dictionary.

text='110011111000011010110011'
s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}

string = ''
i = 0
for j in range(len(text) + 1):
    symbol = s.get(text[i:j])
    if symbol is not None:
        string += symbol
        i = j

print(string)

Out: ++----++--++--+-++++++-+-+-----++-+---+-+---+-++--

Upvotes: 3

John Watson

Reputation: 129

This is @Artogs's "alternative 2" reduce method but with a little code golf

def worker(aggregate, bit):
    prefix, out = aggregate
    key = prefix + bit
    return ('', out + s[key]) if (key in s) else (key, out)

remainder,string = reduce(worker, '110011111000011010110011', ('', ''))
print string

Upvotes: 1

Artog

Reputation: 1142

I can think of two ways to do this, but I'm not sure that alternative 1 is correct in the left-to-right sense. It would be best to write a couple of tests to check that it behaves correctly.

Alternative 1: Using regex

^{(and now you have two problems? :P)}

import re
def transform(input_string, substitutions):

    pattern = "|".join(sorted(substitutions.keys(), key = lambda x: len(x)))
    def replacement(match):
        return substitutions[match.group(0)]

    subs = 1
    while subs > 0:
        input_string, subs = re.subn(pattern, replacement, input_string)
    return input_string


s = {
    '11': '++--',
    '01': '-+-+-', 
    '000': '+-++++++', 
    '001': '--', 
    '100': '--+-', 
    '101': '----++-+--'
}

print(transform('110011111000011010110011', s))

This concatenates all keys, sorted by length, with an or (|), so it should replace the shortest first. Then it just replaces until it can find any more. It works for your case, but I'm unsure how it handles more complex cases.

Alternative 2: Using functools.reduce

from functools import reduce

s = {
    '11': '++--',
    '01': '-+-+-', 
    '000': '+-++++++', 
    '001': '--', 
    '100': '--+-', 
    '101': '----++-+--'
}

def partial_transform(accumulator,bit):

    blob, result = accumulator
    next_blob = blob + bit

    if next_blob in s.keys():
        return ('', result + s[next_blob])

    return (next_blob, result)


print(reduce(partial_transform, '110011111000011010110011', ('','')))

This approach is similar to the other answers as it 'eats' from the left until it finds a match, adds that to the result then continues until the input ends.

Upvotes: 2

Florian H

Reputation: 3082

text='110011111000011010110011'  
s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}
short_keys = [k for k in s.keys() if len(k) == 2]
long_keys = [k for k in s.keys() if len(k) == 3]

i=0
result = ""
while i <= len(text) -2:
    if text[i:i+2] in short_keys:
        result += s[text[i:i+2]]
        i+=2
    else:
        if i == len(text)-3:
            break
        result += s[text[i:i+3]]
        i+=3

Note that this raises a key error if no fitting key is found in the dict.

Upvotes: 0

How do I replace text from beginning to end by shortest to longest match

Answers (4)

Alternative 1: Using regex

Alternative 2: Using functools.reduce

Related Questions