Reputation: 881
I'm trying to capture all the remaining text in a file after three hyphens at the start of a line (---
).
Example:
Anything above this first set of hyphens should not be captured. --- This is content. It should be captured. Any sets of three hyphens beyond this point should be ignored.
Everything after the first set of three hyphens should be captured. The closest I've gotten is using this regex [^(---)]+$
which works slightly. It will capture everything after the hyphens, but if the user places any hyphens after that point it instead then captures after the last hyphen the user placed.
I am using this in combination with python to capture text.
If anyone can help me sort out this regex problem I'd appreciate it.
Upvotes: 1
Views: 380
Reputation: 308
s = open(myfile).read().split('\n\n---\n\n', 1)
print s[0] # first part
print s[1] # second part after the dashes
This should work for your example. The second parameter to split specifies how many times to split the string.
Upvotes: 1
Reputation: 1302
Sorry for not directly answering your question, but I wonder if regular expressions are overcomplicating the problem? You could do something like this:
f = open('myfile', 'r')
for i in f:
if i[:3] == "---":
break
text = f.readlines()
f.close()
Or, am I missing something?
I tend to find that regular expressions are difficult enough to maintain that if you don't need their unique capabilities for a given purpose it'll be cleaner and more readable to avoid using them entirely.
Upvotes: 1
Reputation: 880229
pat = re.compile(r'(?ms)^---(.*)\Z')
The (?ms)
adds the MULTILINE
and DOTALL
flags.
The MULTILINE
flag makes ^
match the beginning of lines (not just the beginning of the string.) We need this because the ---
occurs at the beginning of a line, but not necessarily the beginning of the string.
The DOTALL
flag makes .
match any character, including newlines. We need this so that (.*)
can match more than one line.
\Z
matches the end of the string (as opposed to the end of a line).
For example,
import re
text = '''\
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
'''
pat = re.compile(r'(?ms)^---(.*)\Z')
print(re.search(pat, text).group(1))
prints
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Note that when you define a regex character class with brackets, [...]
, the stuff inside the brackets are (in general, except for hyphenated ranges like a-z
) interpreted as single characters. They are not patterns. So [---]
is not different than [-]
. In fact, [---]
is the range of characters from -
to -
, inclusive.
The parenthese inside the character class are interpreted as literal parentheses too, not grouping delimiters. So [(---)]
is equivalent to [-()]
, the character class including the hyphen and left and right parentheses.
Thus the character class [^(---)]+
matches any character other than the hyphen or parentheses:
In [23]: re.search('[^(---)]+', 'foo - bar').group()
Out[23]: 'foo '
In [24]: re.search('[^(---)]+', 'foo ( bar').group()
Out[24]: 'foo '
You can see where this is going, and why it does not work for your problem.
Upvotes: 1