Reputation: 3862
Two files. One with broken data, the other with fixes. Broken:
ID 0
T5 rat cake
~EOR~
ID 1
T1 wrong segg
T2 wrong nacob
T4 rat tart
~EOR~
ID 3
T5 rat pudding
~EOR~
ID 4
T1 wrong sausag
T2 wrong mspa
T3 strawberry tart
~EOR~
ID 6
T5 with some rat in it
~EOR~
Fixes:
ID 1
T1 eggs
T2 bacon
~EOR~
ID 4
T1 sausage
T2 spam
T4 bereft of loif
~EOR~
EOR means end of record. Note that the Broken file has more records than the fix file, which has tags (T1, T2 etc are tags) to fix and tags to add. This code does exactly what it's supposed to do:
# foobar.py
import codecs
source = 'foo.dat'
target = 'bar.dat'
result = 'result.dat'
with codecs.open(source, 'r', 'utf-8_sig') as s, \
codecs.open(target, 'r', 'utf-8_sig') as t, \
codecs.open(result, 'w', 'utf-8_sig') as u:
sID = ST1 = sT2 = sT4 = ''
RecordFound = False
# get source data, record by record
for sline in s:
if sline.startswith('ID '):
sID = sline
if sline.startswith('T1 '):
sT1 = sline
if sline.startswith('T2 '):
sT2 = sline
if sline.startswith('T4 '):
sT4 = sline
if sline.startswith('~EOR~'):
for tline in t:
# copy target file lines, replacing when necesary
if tline == sID:
RecordFound = True
if tline.startswith('T1 ') and RecordFound:
tline = sT1
if tline.startswith('T2 ') and RecordFound:
tline = sT2
if tline.startswith('~EOR~') and RecordFound:
if sT4:
tline = sT4 + tline
RecordFound = False
u.write(tline)
break
u.write(tline)
for tline in t:
u.write(tline)
I'm writing to a new file because I don't want to mess up the other two. The first outer for loop finishes on the last record in the fixes file. At that point, there are still records to write in the target file. That's what the last for-clause does.
What's nagging me that this last line implicitly picks up where the first inner for loop was last broken out of. It's as if it should say `for the rest of tline in t'. On the other hand, I don't see how I could do this with fewer (or not much more) lines of code (using dicts and what have you). Should I worry at all?
Please comment.
Upvotes: 2
Views: 156
Reputation: 3862
For the sake of completeness, and just to share my enthousiasm and what I learned, below is the code that I now work with. It answers my OP, and more.
It's based in part on akaRem's approach above. A single function fills a dict. It's called twice, once for the fixes file, once for the file-to-fix.
import codecs, collections
from GetInfiles import *
sourcefile, targetfile = GetInfiles('dat')
# GetInfiles reads two input parameters from the command line,
# verifies they exist as files with the right extension,
# and then returns their names. Code not included here.
resultfile = targetfile[:-4] + '_result.dat'
def recordlist(infile):
record = collections.OrderedDict()
reclist = []
with codecs.open(infile, 'r', 'utf-8_sig') as f:
for line in f:
try:
key, value = line.split(' ', 1)
except:
key = line
# so this line must be '~EOR~\n'.
# All other lines must have the shape 'tag: content\n'
# so if this errors, there's something wrong with an input file
if not key.startswith('~EOR~'):
try:
record[key].append(value)
except KeyError:
record[key] = [value]
else:
reclist.append(record)
record = collections.OrderedDict()
return reclist
# put files into ordered dicts
source = recordlist(sourcefile)
target = recordlist(targetfile)
# patching
for fix in source:
for record in target:
if fix['ID'] == record['ID']:
record.update(fix)
# write-out
with codecs.open(resultfile, 'w', 'utf-8_sig') as f:
for record in target:
for tag, field in record.iteritems():
for occ in field:
line = u'{} {}'.format(tag, occ)
f.write(line)
f.write('~EOR~\n')
It's now an ordered dict. This was not in my OP but the files needs to be cross-checked by humans, so keeping the order makes that easier. (Using OrderedDict is really easy. My first attempts at finding this functionality led me to odict, but its documentation worried me. No examples, intimidating jargon...)
Also, it now supports multiple occurrences of any given tag inside a record. This was not in my OP either, but I needed this. (That format is called 'Adlib tagged', it's catalogueing software.)
Different from akaRem's approach is the patching, using update
for the target dict. I find this, as often with python, really and truly elegant. Likewise for startswith
. These are two more reasons I can't resist sharing it.
I hope it's useful.
Upvotes: 0
Reputation: 7618
# building initial storage
content = {}
record = {}
order = []
current = None
with open('broken.file', 'r') as f:
for line in f:
items = line.split(' ', 1)
try:
key, value = items
except:
key, = items
value = None
if key == 'ID':
current = value
order.append(current)
content[current] = record = {}
elif key == '~EOR~':
current = None
record = {}
else:
record[key] = value
# patching
with open('patches.file', 'r') as f:
for line in f:
items = line.split(' ', 1)
try:
key, value = items
except:
key, = items
value = None
if key == 'ID':
current = value
record = content[current] # updates existing records only!
# if there is no such id -> raises
# alternatively you may check and add them to the end of list
# if current in content:
# record = content[current]
# else:
# order.append(current)
# content[current] = record = {}
elif key == '~EOR~':
current = None
record = {}
else:
record[key] = value
# patched!
# write-out
with open('output.file', 'w') as f:
for current in order:
out.write('ID '+current+'\n')
record = content[current]
for key in sorted(record.keys()):
out.write(key + ' ' + (record[key] or '') + '\n')
# job's done
questions?
Upvotes: 1
Reputation: 674
I wouldn't worry. In your example, t
is a file handle and you are iterating over it. File handles in Python are their own iterators; they have state information about where they've read in the file and will keep their place as you iterate over them. You can check the python docs for file.next() for more info.
See also another SO answer that also talks about iterators:What does the "yield" keyword do in Python?. Lots of helpful information there!
Edit: Here's another way to combine them using dictionaries. This method may be desirable if you want to do other modifications to the records before you output:
import sys
def get_records(source_lines):
records = {}
current_id = None
for line in source_lines:
if line.startswith('~EOR~'):
continue
# Split the line up on the first space
tag, val = [l.rstrip() for l in line.split(' ', 1)]
if tag == 'ID':
current_id = val
records[current_id] = {}
else:
records[current_id][tag] = val
return records
if __name__ == "__main__":
with open(sys.argv[1]) as f:
broken = get_records(f)
with open(sys.argv[2]) as f:
fixed = get_records(f)
# Merge the broken and fixed records
repaired = broken
for id in fixed.keys():
repaired[id] = dict(broken[id].items() + fixed[id].items())
with open(sys.argv[3], 'w') as f:
for id, tags in sorted(repaired.items()):
f.write('ID {}\n'.format(id))
for tag, val in sorted(tags.items()):
f.write('{} {}\n'.format(tag, val))
f.write('~EOR~\n')
The dict(broken[id].items() + fixed[id].items())
part takes advantage of this:
How to merge two Python dictionaries in a single expression?
Upvotes: 2