WoJ
WoJ

Reputation: 29987

How to read a file and extract data between multiline patterns?

I have a file from which I need to extract one piece of data, delimited by (possibly) multiline fixed patterns

some data ... [my opening pattern
is here
and can be multiline] the data 
I want to extract [my ending
pattern which can be
multiline as well] ... more data

These patterns are fixed in the sense that the content is always the same, except that it can include new lines between words.

The solution would be simple if I had the assurance that my pattern will be predictably formatted but do not.

Is there a way to match such "patterns" to a stream?

There is a question which is an almost duplicate and the answers point towards buffering the input. The difference in my case is that I know exact strings in the pattern, except that the words are possibly also delimited by a newline (so no need for \w* kind of matches)

Upvotes: 2

Views: 1408

Answers (1)

Quinn
Quinn

Reputation: 4504

Are you looking for this?

>>> import re
>>> data = """
... some data ... [my opening pattern
... is here
... and can be multiline] the data
... I want to extract [my ending
... pattern which can be
... multiline as well] ... more data
... """
>>> re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', data)
['the data \nI want to extract']

UPDATE To read a large file into chunks, I suggest the following approach:

## The following was modified based on ChrisA's code in
## http://www.gossamer-threads.com/lists/python/python/1242366.
## Titled " How to read from a file to an arbitrary delimiter efficiently?"
import re

class ChunkIter:
    def __init__(self, f, delim):
        """ f: file object
        delim: regex pattern"""        
        self.f = f
        self.delim = re.compile(delim)
        self.buffer = ''
        self.part = '' # the string to return

    def read_to_delim(self):
        """Return characters up to the last delim, or None if at EOF"""

        while "delimiter not found":
            b = self.f.read(256)
            if not b: # if EOF
                self.part = None
                break
            # Continue reading to buffer
            self.buffer += b
            # Try regex split the buffer string    
            parts = self.delim.split(self.buffer)
            # If pattern is found
            if parts[:-1]:
                # Retrieve the string up to the last delim
                self.part = ''.join(parts[:-1])
                # Reset buffer string
                self.buffer = parts[-1]
                break   

        return self.part

if __name__ == '__main__':
    with open('input.txt', 'r') as f:
        chunk = ChunkIter(f, '(\[[^]]*\]\s+(?:[^[]+)\s+\[[^]]+\])')
        while chunk.read_to_delim():
             print re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', chunk.part)

    print 'job done.'

Upvotes: 2

Related Questions