Reputation: 29987
I have a file from which I need to extract one piece of data, delimited by (possibly) multiline fixed patterns
some data ... [my opening pattern
is here
and can be multiline] the data
I want to extract [my ending
pattern which can be
multiline as well] ... more data
These patterns are fixed in the sense that the content is always the same, except that it can include new lines between words.
The solution would be simple if I had the assurance that my pattern will be predictably formatted but do not.
Is there a way to match such "patterns" to a stream?
There is a question which is an almost duplicate and the answers point towards buffering the input. The difference in my case is that I know exact strings in the pattern, except that the words are possibly also delimited by a newline (so no need for \w*
kind of matches)
Upvotes: 2
Views: 1408
Reputation: 4504
Are you looking for this?
>>> import re
>>> data = """
... some data ... [my opening pattern
... is here
... and can be multiline] the data
... I want to extract [my ending
... pattern which can be
... multiline as well] ... more data
... """
>>> re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', data)
['the data \nI want to extract']
UPDATE To read a large file into chunks, I suggest the following approach:
## The following was modified based on ChrisA's code in
## http://www.gossamer-threads.com/lists/python/python/1242366.
## Titled " How to read from a file to an arbitrary delimiter efficiently?"
import re
class ChunkIter:
def __init__(self, f, delim):
""" f: file object
delim: regex pattern"""
self.f = f
self.delim = re.compile(delim)
self.buffer = ''
self.part = '' # the string to return
def read_to_delim(self):
"""Return characters up to the last delim, or None if at EOF"""
while "delimiter not found":
b = self.f.read(256)
if not b: # if EOF
self.part = None
break
# Continue reading to buffer
self.buffer += b
# Try regex split the buffer string
parts = self.delim.split(self.buffer)
# If pattern is found
if parts[:-1]:
# Retrieve the string up to the last delim
self.part = ''.join(parts[:-1])
# Reset buffer string
self.buffer = parts[-1]
break
return self.part
if __name__ == '__main__':
with open('input.txt', 'r') as f:
chunk = ChunkIter(f, '(\[[^]]*\]\s+(?:[^[]+)\s+\[[^]]+\])')
while chunk.read_to_delim():
print re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', chunk.part)
print 'job done.'
Upvotes: 2