Reputation: 8275
SO, I am trying create a simple regex that matches the following string:
<PRE>><A HREF="../cgi-bin/hgTracks?hgsid=160564920&db=hg18&position=chrX:33267175-33267784&hgPcrResult=pack">chrX:33267175-33267784</A> 610bp TGATGTTTGGCGAGGAACTC GCAGAGTTTGAAGAGCTCGG
TGATGTTTGGCGAGGAACTCtactattgttacacttaggaaaataatcta
atccaaaggctttgcatctgtacagaagagcgagtagatactgaaagaga
tttgcagatccactgttttttaggcaggaagaatgctcgttaaatgcaaa
cgctgctctggctcatgtgtttgctccgaggtataggttttgttcgactg
acgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgca
taataaatgactgaaagaatcatgttaggcatgcccacctaacctaactt
gaatcatgcgaaaggggagctgttggaattcaaatagactttctggttcc
cagcagtcggcagtaatagaatgctttcaggaagatgacagaatcaggag
aaagatgctgttttgcactatcttgatttgttacagcagccaacttattg
gcatgatggagtgacaggaaaaacagctggcatggaaggtaggattatta
aagctattacatcattacaaatacaattagaagctggccatgacaaagca
tatgtttgaacaagcagctgttggtagctggggtttgttgCCGAGCTCTT
CAAACTCTGC
</PRE>
I have created the following regex:
<PRE>[.|[\n]]*</PRE>
yet it won't match the string above. Does anyone have a solution to this conundrum and perhaps a reasoning as toward why this doesn't work.
Sorry about the formatting of this question.
Upvotes: 1
Views: 228
Reputation: 69021
The issue is that inside []
's the .
is a period, not a match-anything dot; the |
is a pipe, not an or
; and the [
and ]
are braces, not character-class creators -- in other words, the non-backslash special symbols lose their specialness.
What you will want to do is this:
m = re.search(r'(<PRE>.*</PRE>)', input_string, re.DOTALL)
m.group(1)
.search()
will look everywhere in the string for the match (.match()
only checks the beginning of the string), and re.DOTALL
(or re.S
) will have the .
match newlines as well.
If you don't want the <PRE>
and </PRE>
tags included, move the parentheses to surround the .*
.
Upvotes: 0
Reputation: 24823
If you're going to parse HTML, please use lxml, as Hank proposed.
But for this regex to work, you need to change the []
to ()
. A |
inside square brackets is interpreted as the symbol '|' and not as an OR operator.
Another option is to use the flag that's called DOTALL, which makes the dot operator match anything, including a newline. This way the regex becomes very simple:
m = re.match(r'<PRE>(.*)</PRE>', input_string, re.DOTALL)
m.group(1)
outputs the string inside the PRE, without the < PRE >
and< /PRE >
themselves.
Upvotes: 1
Reputation: 71939
Stop trying to parse HTML using regexes. You can't do it (robustly). There's a reason there's this famous SO answer. Use lxml instead.
Upvotes: 2