scootie
scootie

Reputation: 43

How to recover only the second instance of a string in a text file?

I have a large number of text files (>1000) with the same format for all.

The part of the file I'm interested in looks something like:

# event 9
num:     1
length:      0.000000
otherstuff: 19.9 18.8 17.7
length: 0.000000 176.123456

# event 10
num:     1
length:      0.000000
otherstuff: 1.1 2.2 3.3
length: 0.000000 1201.123456

I need only the second index value of the second instance of the defined variable, in this case length. Is there a pythonic way of doing this (i.e. not sed)?

My code looks like:

with open(wave_cat,'r') as catID:
        for i, cat_line in enumerate(catID):
            if not len(cat_line.strip()) == 0:
                line    = cat_line.split()
                #replen = re.sub('length:','length0:','length:')
                if line[0] == '#' and line[1] == 'event':
                    num = long(line[2])
                elif line[0] == 'length:':
                    Length = float(line[2])

Upvotes: 0

Views: 85

Answers (3)

dawg
dawg

Reputation: 104111

If you can read the entire file into memory, just do a regex against the file contents:

for fn in [list of your files, maybe from a glob]:
    with open(fn) as f:
        try:
            nm=pat.findall(f.read())[1]
        except IndexError:
            nm=''
        print nm   

If larger files, use mmap:

import re, mmap

nth=1
pat=re.compile(r'^# event.*?^length:.*?^length:\s[\d.]+\s(\d+\.\d+)', re.S | re.M)
for fn in [list of your files, maybe from a glob]:
    with open(fn, 'r+b') as f:
        mm = mmap.mmap(f.fileno(), 0)
        for i, m in enumerate(pat.finditer(mm)):
            if i==nth:
                print m.group(1)
                break

Upvotes: 1

chthonicdaemon
chthonicdaemon

Reputation: 19830

You're on the right track. It'll probably be a bit faster deferring the splitting unless you actually need it. Also, if you're scanning lots of files and only want the second length entry, it will save a lot of time to break out of the loop once you've seen it.

length_seen = 0
elements = []
with open(wave_cat,'r') as catID:
    for line in catID:
        line = line.strip()
        if not line:
            continue
        if line.startswith('# event'):
            element = {'num': int(line.split()[2])}
            elements.append(element)
            length_seen = 0
        elif line.startswith('length:'):
            length_seen += 1
            if length_seen == 2:
                element['length'] = float(line.split()[2])

Upvotes: 0

Tom Zych
Tom Zych

Reputation: 13596

Use a counter:

with open(wave_cat,'r') as catID:
    ct = 0
    for i, cat_line in enumerate(catID):
        if not len(cat_line.strip()) == 0:
            line    = cat_line.split()
            #replen = re.sub('length:','length0:','length:')
            if line[0] == '#' and line[1] == 'event':
                num = long(line[2])
            elif line[0] == 'length:':
                ct += 1
                if ct == 2:
                    Length = float(line[2])
                    ct = 0

Upvotes: 0

Related Questions