Reputation: 127
I have a structured text file containing a number of multi-line records. Each record should have a key unique field. I need to read through a series of these files, finding the non-unique key fields and replacing the key value with unique values.
My script is identifying all the fields which need replacing. I store these fields in a dictionary where the key is the non-unique field and the value is a list of unique values.
Eg:
{
"1111111111" : ["1234566363", "5533356775", "6443458343"]
}
What I would like to do is read through each file just once, finding instances of "1111111111" (dict key) and replacing the first match with the first key value, second match with the second key value etc.
I am trying to use a regular expression but I am not sure how to construct a suitable RE without looping through the file multiple times
This is my current code:
def multireplace(Text, Vars):
dictSorted = sorted(Vars, key=len, reverse=True)
regEx = re.compile('|'.join(map(re.escape, dictSorted)))
return regEx.sub(lambda match: Vars[match.group(0)], Text)
text = multireplace(text, find_replace_dict)
It works fine for single key:value combinations but will fail to compile if the :value is a list:
return regEx.sub(lambda match: Vars[match.group(0)], Text , 1)
TypeError: sequence item 1: expected str instance, list found
It is possible to alter the function without looping multiple times through a file?
Upvotes: 1
Views: 127
Reputation: 60143
Take a look and read through the comments. Let me know if anything doesn't make sense:
import re
def replace(text, replacements):
# Make a copy so we don't destroy the original.
replacements = replacements.copy()
# This is essentially what you had already.
regex = re.compile("|".join(map(re.escape, replacements.keys())))
# In our lambda, we pop the first element from the array. This way,
# each time we're called with the same group, we'll get the next replacement.
return regex.sub(lambda m: replacements[m.group(0)].pop(0), text)
print(replace("A A B B A B", {"A": ["A1", "A2", "A3"], "B": ["B1", "B2", "B3"]}))
# Output:
# A1 A2 B1 B2 A3 B3
UPDATE
To help with the issue in the comments below, try this version, which will tell you exactly which string ran out of replacements:
import re
def replace(text, replacements):
# Let's make a method so we can do a little more than the lambda.
def make_replacement(match):
try:
return replacements[match.group(0)].pop(0)
except IndexError:
# Print out debug info about what happened
print("Ran out of replacements for {}".format(match.group(0)))
# Re-raise so the process still exits.
raise
# Make a copy so we don't destroy the original.
replacements = replacements.copy()
# This is essentially what you had already.
regex = re.compile("|".join(map(re.escape, replacements.keys())))
# In our lambda, we pop the first element from the array. This way,
# each time we're called with the same group, we'll get the next replacement.
return regex.sub(make_replacement, text)
print(replace("A A B B A B A", {"A": ["A1", "A2", "A3"], "B": ["B1", "B2", "B3"]}))
# Output:
# A1 A2 B1 B2 A3 B3
Upvotes: 1