Battleroid
Battleroid

Reputation: 881

Matching everything after series of hyphens

I'm trying to capture all the remaining text in a file after three hyphens at the start of a line (---).

Example:

Anything above this first set of hyphens should not be captured.

---

This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.

Everything after the first set of three hyphens should be captured. The closest I've gotten is using this regex [^(---)]+$ which works slightly. It will capture everything after the hyphens, but if the user places any hyphens after that point it instead then captures after the last hyphen the user placed.

I am using this in combination with python to capture text.

If anyone can help me sort out this regex problem I'd appreciate it.

Upvotes: 1

Views: 380

Answers (3)

beetea
beetea

Reputation: 308

s = open(myfile).read().split('\n\n---\n\n', 1)
print s[0] # first part
print s[1] # second part after the dashes

This should work for your example. The second parameter to split specifies how many times to split the string.

Upvotes: 1

Mark R. Wilkins
Mark R. Wilkins

Reputation: 1302

Sorry for not directly answering your question, but I wonder if regular expressions are overcomplicating the problem? You could do something like this:

f = open('myfile', 'r')

for i in f:
    if i[:3] == "---":
        break

text = f.readlines()

f.close()

Or, am I missing something?

I tend to find that regular expressions are difficult enough to maintain that if you don't need their unique capabilities for a given purpose it'll be cleaner and more readable to avoid using them entirely.

Upvotes: 1

unutbu
unutbu

Reputation: 880229

pat = re.compile(r'(?ms)^---(.*)\Z')

The (?ms) adds the MULTILINE and DOTALL flags.

The MULTILINE flag makes ^ match the beginning of lines (not just the beginning of the string.) We need this because the --- occurs at the beginning of a line, but not necessarily the beginning of the string.

The DOTALL flag makes . match any character, including newlines. We need this so that (.*) can match more than one line.

\Z matches the end of the string (as opposed to the end of a line).

For example,

import re

text = '''\    
Anything above this first set of hyphens should not be captured.

---

This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
'''

pat = re.compile(r'(?ms)^---(.*)\Z')
print(re.search(pat, text).group(1))

prints

This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.

Note that when you define a regex character class with brackets, [...], the stuff inside the brackets are (in general, except for hyphenated ranges like a-z) interpreted as single characters. They are not patterns. So [---] is not different than [-]. In fact, [---] is the range of characters from - to -, inclusive.

The parenthese inside the character class are interpreted as literal parentheses too, not grouping delimiters. So [(---)] is equivalent to [-()], the character class including the hyphen and left and right parentheses.

Thus the character class [^(---)]+ matches any character other than the hyphen or parentheses:

In [23]: re.search('[^(---)]+', 'foo - bar').group()
Out[23]: 'foo '

In [24]: re.search('[^(---)]+', 'foo ( bar').group()
Out[24]: 'foo '

You can see where this is going, and why it does not work for your problem.

Upvotes: 1

Related Questions