newToProgramming
newToProgramming

Reputation: 8275

Regex in Python

SO, I am trying create a simple regex that matches the following string:

<PRE>><A HREF="../cgi-bin/hgTracks?hgsid=160564920&db=hg18&position=chrX:33267175-33267784&hgPcrResult=pack">chrX:33267175-33267784</A> 610bp TGATGTTTGGCGAGGAACTC GCAGAGTTTGAAGAGCTCGG
TGATGTTTGGCGAGGAACTCtactattgttacacttaggaaaataatcta
atccaaaggctttgcatctgtacagaagagcgagtagatactgaaagaga
tttgcagatccactgttttttaggcaggaagaatgctcgttaaatgcaaa
cgctgctctggctcatgtgtttgctccgaggtataggttttgttcgactg
acgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgca
taataaatgactgaaagaatcatgttaggcatgcccacctaacctaactt
gaatcatgcgaaaggggagctgttggaattcaaatagactttctggttcc
cagcagtcggcagtaatagaatgctttcaggaagatgacagaatcaggag
aaagatgctgttttgcactatcttgatttgttacagcagccaacttattg
gcatgatggagtgacaggaaaaacagctggcatggaaggtaggattatta
aagctattacatcattacaaatacaattagaagctggccatgacaaagca
tatgtttgaacaagcagctgttggtagctggggtttgttgCCGAGCTCTT
CAAACTCTGC
</PRE>

I have created the following regex:

<PRE>[.|[\n]]*</PRE>

yet it won't match the string above. Does anyone have a solution to this conundrum and perhaps a reasoning as toward why this doesn't work.

Sorry about the formatting of this question.

Upvotes: 1

Views: 228

Answers (3)

Ethan Furman
Ethan Furman

Reputation: 69021

The issue is that inside []'s the . is a period, not a match-anything dot; the | is a pipe, not an or; and the [ and ] are braces, not character-class creators -- in other words, the non-backslash special symbols lose their specialness.

What you will want to do is this:

m = re.search(r'(<PRE>.*</PRE>)', input_string, re.DOTALL)
m.group(1)

.search() will look everywhere in the string for the match (.match() only checks the beginning of the string), and re.DOTALL (or re.S) will have the . match newlines as well.

If you don't want the <PRE> and </PRE> tags included, move the parentheses to surround the .*.

Upvotes: 0

Ofri Raviv
Ofri Raviv

Reputation: 24823

If you're going to parse HTML, please use lxml, as Hank proposed.

But for this regex to work, you need to change the [] to (). A | inside square brackets is interpreted as the symbol '|' and not as an OR operator.

Another option is to use the flag that's called DOTALL, which makes the dot operator match anything, including a newline. This way the regex becomes very simple:

m = re.match(r'<PRE>(.*)</PRE>', input_string, re.DOTALL)
m.group(1)

outputs the string inside the PRE, without the < PRE >and< /PRE > themselves.

Upvotes: 1

Hank Gay
Hank Gay

Reputation: 71939

Stop trying to parse HTML using regexes. You can't do it (robustly). There's a reason there's this famous SO answer. Use lxml instead.

Upvotes: 2

Related Questions