Reputation: 7222
How do I replace text from beginning to end by shortest to longest match
I have the data:
In [1807]: text='110011111000011010110011'
and a dictionary that looks like this:
s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}
and i want the output of text replaced to
'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'
So the first 11, changes to ++--, the second match is 001, which is then --, the third match is 11 which is then replaced with ++--. The shortest match replaces first, but if not found, then the second longest match is then replaced. This has to be iteratively from the beginning of the string, or it does not work. I'm looking for an idiomatic approach to this issue.
The current code i use works, but i think there is a more idiomatic way to solve this issue. This is what i currently have:
s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}
peekstart=0
peekend=1
sptr='110011111000011010110011'
bytestr=b''
while peekstart+peekend <= sptrlen:
match=False
while match == False:
peek = sptr[peekstart:peekstart+peekend]
findptr = s.get(peek)
while findptr == None:
peek = sptr[peekstart:peekstart+peekend+stream]
stream=stream+1
findptr = s.get(peek)
if stream == 0:
peekstart=peekstart+peekend+stream
else:
peekstart=peekstart+peekend+stream-1
stream=0
bytestr+=s.get(peek).encode()
match = True
print(bytestr)
b'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'
Is there a more idiomatic way to solve this problem?
ANSWERED
Using the answers from here i came up with this new version that uncompresses very nicely and deals with not needing a buffer for binaries that start with zeros:
from functools import reduce
s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}
def decompressHuffmanCode(a, bit):
return ('', a[1] + a[2][a[0]+bit[0]], a[2]) if (a[0]+bit[0] in a[2]) else (a[0]+bit[0], a[1], a[2])
# Added one to the front to deal with cases of the binary starting with zeroes
compressed_binary='1110011111000011010110011'
read_compressed_binary = compressed_binary[1:]
remainder,bytestr,s = reduce(decompressHuffmanCode, read_compressed_binary, ('', '', s))
print(bytestr)
Thanks to everyone for their replies, they lead to this shortened version
Upvotes: 1
Views: 194
Reputation: 746
This is a common eat-function. You eat from left until you reach something that matches in your dictionary.
text='110011111000011010110011'
s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}
string = ''
i = 0
for j in range(len(text) + 1):
symbol = s.get(text[i:j])
if symbol is not None:
string += symbol
i = j
print(string)
Out: ++----++--++--+-++++++-+-+-----++-+---+-+---+-++--
Upvotes: 3
Reputation: 129
This is @Artogs's "alternative 2" reduce method but with a little code golf
def worker(aggregate, bit):
prefix, out = aggregate
key = prefix + bit
return ('', out + s[key]) if (key in s) else (key, out)
remainder,string = reduce(worker, '110011111000011010110011', ('', ''))
print string
Upvotes: 1
Reputation: 1142
I can think of two ways to do this, but I'm not sure that alternative 1 is correct in the left-to-right sense. It would be best to write a couple of tests to check that it behaves correctly.
(and now you have two problems? :P)
import re
def transform(input_string, substitutions):
pattern = "|".join(sorted(substitutions.keys(), key = lambda x: len(x)))
def replacement(match):
return substitutions[match.group(0)]
subs = 1
while subs > 0:
input_string, subs = re.subn(pattern, replacement, input_string)
return input_string
s = {
'11': '++--',
'01': '-+-+-',
'000': '+-++++++',
'001': '--',
'100': '--+-',
'101': '----++-+--'
}
print(transform('110011111000011010110011', s))
This concatenates all keys, sorted by length, with an or
(|
), so it should replace the shortest first. Then it just replaces until it can find any more. It works for your case, but I'm unsure how it handles more complex cases.
from functools import reduce
s = {
'11': '++--',
'01': '-+-+-',
'000': '+-++++++',
'001': '--',
'100': '--+-',
'101': '----++-+--'
}
def partial_transform(accumulator,bit):
blob, result = accumulator
next_blob = blob + bit
if next_blob in s.keys():
return ('', result + s[next_blob])
return (next_blob, result)
print(reduce(partial_transform, '110011111000011010110011', ('','')))
This approach is similar to the other answers as it 'eats' from the left until it finds a match, adds that to the result then continues until the input ends.
Upvotes: 2
Reputation: 3082
text='110011111000011010110011'
s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}
short_keys = [k for k in s.keys() if len(k) == 2]
long_keys = [k for k in s.keys() if len(k) == 3]
i=0
result = ""
while i <= len(text) -2:
if text[i:i+2] in short_keys:
result += s[text[i:i+2]]
i+=2
else:
if i == len(text)-3:
break
result += s[text[i:i+3]]
i+=3
Note that this raises a key error if no fitting key is found in the dict.
Upvotes: 0