Reputation: 43
I have a large number of text files (>1000) with the same format for all.
The part of the file I'm interested in looks something like:
# event 9
num: 1
length: 0.000000
otherstuff: 19.9 18.8 17.7
length: 0.000000 176.123456
# event 10
num: 1
length: 0.000000
otherstuff: 1.1 2.2 3.3
length: 0.000000 1201.123456
I need only the second index value of the second instance of the defined variable, in this case length. Is there a pythonic way of doing this (i.e. not sed)?
My code looks like:
with open(wave_cat,'r') as catID:
for i, cat_line in enumerate(catID):
if not len(cat_line.strip()) == 0:
line = cat_line.split()
#replen = re.sub('length:','length0:','length:')
if line[0] == '#' and line[1] == 'event':
num = long(line[2])
elif line[0] == 'length:':
Length = float(line[2])
Upvotes: 0
Views: 85
Reputation: 104111
If you can read the entire file into memory, just do a regex against the file contents:
for fn in [list of your files, maybe from a glob]:
with open(fn) as f:
try:
nm=pat.findall(f.read())[1]
except IndexError:
nm=''
print nm
If larger files, use mmap:
import re, mmap
nth=1
pat=re.compile(r'^# event.*?^length:.*?^length:\s[\d.]+\s(\d+\.\d+)', re.S | re.M)
for fn in [list of your files, maybe from a glob]:
with open(fn, 'r+b') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
if i==nth:
print m.group(1)
break
Upvotes: 1
Reputation: 19830
You're on the right track. It'll probably be a bit faster deferring the splitting unless you actually need it. Also, if you're scanning lots of files and only want the second length entry, it will save a lot of time to break out of the loop once you've seen it.
length_seen = 0
elements = []
with open(wave_cat,'r') as catID:
for line in catID:
line = line.strip()
if not line:
continue
if line.startswith('# event'):
element = {'num': int(line.split()[2])}
elements.append(element)
length_seen = 0
elif line.startswith('length:'):
length_seen += 1
if length_seen == 2:
element['length'] = float(line.split()[2])
Upvotes: 0
Reputation: 13596
Use a counter:
with open(wave_cat,'r') as catID:
ct = 0
for i, cat_line in enumerate(catID):
if not len(cat_line.strip()) == 0:
line = cat_line.split()
#replen = re.sub('length:','length0:','length:')
if line[0] == '#' and line[1] == 'event':
num = long(line[2])
elif line[0] == 'length:':
ct += 1
if ct == 2:
Length = float(line[2])
ct = 0
Upvotes: 0