Masoumeh Javanbakht
Masoumeh Javanbakht

Reputation: 135

RegEx for matching HTML tags

I am trying to use regular expression to extract start tags in lines of a given HTML code. In the following lines I expect to get only 'body' and 'h1'as start tags in the first line and 'html','head' and 'title' as start tags in the second line:

I have already tried to do this using the following regular expression:

start_tags = re.findall(r'<(\w+)\s*.*?[^\/]>',line)

'<body data-modal-target class=\'3\'><h1>Website</h1><br /></body></html>'
'<html><head><title>HTML Parser - II</title></head>'

But my output for the first line is: ['body','h1','br'], while I do not expect to catch 'br' as I excluded '/'.

And for the second line is ['html','title'], whereas I expect to catch 'head' too. It would be a grate kind if you let me know which part of my code is wrong?

Upvotes: 0

Views: 8006

Answers (1)

Emma
Emma

Reputation: 27723

If you wish to do so with regular expressions, you might want to design multiple different expressions, step by step. You may be able to connect them using OR pipes, but it may not be necessary.

RegEx 1 for h1-h6 tags

This link helps you to capture body tags excluding body and head:

(<(.*)>(.*)</([^br][A-Za-z0-9]+)>)

You might want to add more boundaries to it. For example, you can replace (.*) with lists of chars [].

enter image description here

RegEx Circuit

This link helps you to visualize your expressions:

enter image description here

RegEx 2 for head and body

For head and body tags, you might want to swipe the new lines, which you might want an expression similar to:

(<head>([\s\S]*)<\/head>)|(<body>([\s\S]*)</body>)

enter image description here

Performance

These expressions are rather expensive, you might want to simplify them, or write some other scripts to parse your HTMLs, or find a HTML parser maybe, to do so.

Upvotes: 3

Related Questions