Reputation: 135
I am trying to use regular expression to extract start tags in lines of a given HTML code. In the following lines I expect to get only 'body' and 'h1'as start tags in the first line and 'html','head' and 'title' as start tags in the second line:
I have already tried to do this using the following regular expression:
start_tags = re.findall(r'<(\w+)\s*.*?[^\/]>',line)
'<body data-modal-target class=\'3\'><h1>Website</h1><br /></body></html>'
'<html><head><title>HTML Parser - II</title></head>'
But my output for the first line is: ['body','h1','br'], while I do not expect to catch 'br' as I excluded '/'.
And for the second line is ['html','title'], whereas I expect to catch 'head' too. It would be a grate kind if you let me know which part of my code is wrong?
Upvotes: 0
Views: 8006
Reputation: 27723
If you wish to do so with regular expressions, you might want to design multiple different expressions, step by step. You may be able to connect them using OR pipes, but it may not be necessary.
This link helps you to capture body tags excluding body and head:
(<(.*)>(.*)</([^br][A-Za-z0-9]+)>)
You might want to add more boundaries to it. For example, you can replace (.*)
with lists of chars []
.
This link helps you to visualize your expressions:
For head and body tags, you might want to swipe the new lines, which you might want an expression similar to:
(<head>([\s\S]*)<\/head>)|(<body>([\s\S]*)</body>)
These expressions are rather expensive, you might want to simplify them, or write some other scripts to parse your HTMLs, or find a HTML parser maybe, to do so.
Upvotes: 3