Reputation: 6140
I need to match all of these opening tags:
<p>
<a href="foo">
But not self-closing tags:
<br />
<hr class="foo" />
I came up with this and wanted to make sure I've got it right. I am only capturing the a-z
.
<([a-z]+) *[^/]*?>
I believe it says:
/
, thenDo I have that right? And more importantly, what do you think?
Upvotes: 2314
Views: 3927454
Reputation: 807
While parsing arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML.
If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's web site. This was a limited, one-time job.
Regexes worked just fine for me, and were very fast to set up.
Upvotes: 3594
Reputation: 4789
About the question of the regular expression methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, since nobody here spoke about recursion.
A regular expression-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.
After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
No, holy cow, no match found. Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.
If an (x)HTML parser needs recursion, a regular expression parser without recursion is not enough for the purpose. It's a simple construct.
The black art of regular expressions is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)
Here's the magic pattern:
$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/s";
Just try it. It's written as a PHP string, so the "s" modifier makes classes include newlines.
Here's a sample note on the PHP manual I wrote in January: Reference
(Take care. In that note I wrongly used the "m" modifier; it should be erased, notwithstanding it is discarded by the regular expression engine, since no ^
or $
anchoring was used).
Now, we could speak about the limits of this method from a more informed point of view:
Anyhow, it is only a regular expression pattern, but it discloses the possibility to develop of a lot of powerful implementations.
I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).
Upvotes: 74
Reputation: 8379
I like to parse HTML with regular expressions. I don't attempt to parse idiot HTML that is deliberately broken. This code is my main parser (Perl edition):
$_ = join "",<STDIN>; tr/\n\r \t/ /s; s/</\n</g; s/>/>\n/g; s/\n ?\n/\n/g;
s/^ ?\n//s; s/ $//s; print
It's called htmlsplit, splits the HTML into lines, with one tag or chunk of text on each line. The lines can then be processed further with other text tools and scripts, such as grep, sed, Perl, etc. I'm not even joking :) Enjoy.
It is simple enough to rejig my slurp-everything-first Perl script into a nice streaming thing, if you wish to process enormous web pages. But it's not really necessary.
Some better regular expressions:
/(<.*?>|[^<]+)\s*/g # Get tags and text
/(\w+)="(.*?)"/g # Get attibutes
They are good for XML / XHTML.
With minor variations, it can cope with messy HTML... or convert the HTML -> XHTML first.
The best way to write regular expressions is in the Lex / Yacc style, not as opaque one-liners or commented multi-line monstrosities. I didn't do that here, yet; these ones barely need it.
Upvotes: 87
Reputation: 4789
Here's the solution:
<?php
// here's the pattern:
$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*(\/>|>)/';
// a string to parse:
$string = 'Hello, try clicking <a href="#paragraph">here</a>
<br/>and check out.<hr />
<h2>title</h2>
<a name ="paragraph" rel= "I\'m an anchor"></a>
Fine, <span title=\'highlight the "punch"\'>thanks<span>.
<div class = "clear"></div>
<br>';
// let's get the occurrences:
preg_match_all($pattern, $string, $matches, PREG_PATTERN_ORDER);
// print the result:
print_r($matches[0]);
?>
To test it deeply, I entered in the string auto-closing tags like:
I also entered tags with:
Should you find something which does not work in the proof of concept above, I am available in analyzing the code to improve my skills.
<EDIT> I forgot that the question from the user was to avoid the parsing of self-closing tags. In this case the pattern is simpler, turning into this:
$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*>/';
The user @ridgerunner noticed that the pattern does not allow unquoted attributes or attributes with no value. In this case a fine tuning brings us the following pattern:
$pattern = '/<(\w+)(\s+(\w+)(\s*\=\s*(\'|"|)(.*?)\\5\s*)?)*\s*>/';
</EDIT>
Small tip: to better analyze this code it is necessary looking at the source code generated since I did not provide any HTML special characters escaping.
Upvotes: 100
Reputation: 11452
If you're simply trying to find those tags (without ambitions of parsing) try this regular expression:
/<[^/]*?>/g
I wrote it in 30 seconds, and tested here: https://regexr.com/
It matches the types of tags you mentioned, while ignoring the types you said you wanted to ignore.
Upvotes: 58
Reputation: 161
To match open tags (start tags) except XHTML self-contained tags, you can use the following regular expression:
<[^/][^>]*>
<
: Matches the opening angle bracket.[^/]
: Matches any character except the forward slash /
, ensuring the tag is not a closing tag.[^>]*
: Matches zero or more characters, not the closing angle bracket >
, allowing any attributes to be present.>
: Matches the closing angle bracket, completing the tag.Upvotes: -4
Reputation: 229
<([a-z][^>\s]*)(?:\s+[^>]+)?>
This regular expression will match opening tags that consist of a single word (e.g. <p>, <a>, etc.), followed by zero or more spaces and any number of characters (except > or whitespace) before the closing > character. It will also match tags with attributes, and will not match tags with names that contain characters other than a-z. However, it will still not match self-closing tags.
Upvotes: 2
Reputation: 16928
I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), you can't possibly make this work.
But many will try, and some will even claim success - but until others find the fault and totally mess you up.
Upvotes: 2366
Reputation: 111940
There are people that will tell you that the Earth is round (or perhaps that the Earth is an oblate spheroid if they want to use strange words). They are lying.
There are people that will tell you that Regular Expressions shouldn't be recursive. They are limiting you. They need to subjugate you, and they do it by keeping you in ignorance.
You can live in their reality or take the red pill.
Like Lord Marshal (is he a relative of the Marshal .NET class?), I have seen the Underverse Stack Based Regex-Verse and returned with powers knowledge you can't imagine. Yes, I think there were an Old One or two protecting them, but they were watching football on the TV, so it wasn't difficult.
I think the XML case is quite simple. The RegEx (in the .NET syntax), deflated and coded in base64 to make it easier to comprehend by your feeble mind, should be something like this:
7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28
995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F
86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169
OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq
i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv
p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf
LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e
Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7
O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm
rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv
z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme
nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e
vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y
gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs
mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH
W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52
MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU
1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn
xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ
GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY
12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37
R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn
3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25
D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP
mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS
mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX
X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8
DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c
etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3
zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS
ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ
j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX
/ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d
mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u
v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj
4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq
GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6
mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K
MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z
0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26
7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29
7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9
r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va
j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd
w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa
2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm
AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C
j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8
fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+
+fz0KSu73/dtkGDYdvZgSP5uskadrtViRKyal2IKAiQfiW+FI+tET/9/Txj9SFf8SFf8rOuKzagx
+r/vD34mUADO1P4/AQAA//8=
The options to set is RegexOptions.ExplicitCapture
. The capture group you are looking for is ELEMENTNAME
. If the capture group ERROR
is not empty then there was a parsing error and the Regex stopped.
If you have problems reconverting it to a human-readable regex, this should help:
static string FromBase64(string str)
{
byte[] byteArray = Convert.FromBase64String(str);
using (var msIn = new MemoryStream(byteArray))
using (var msOut = new MemoryStream()) {
using (var ds = new DeflateStream(msIn, CompressionMode.Decompress)) {
ds.CopyTo(msOut);
}
return Encoding.UTF8.GetString(msOut.ToArray());
}
}
If you are unsure, no, I'm NOT kidding (but perhaps I'm lying). It WILL work. I've built tons of unit tests to test it, and I have even used (part of) the conformance tests. It's a tokenizer, not a full-blown parser, so it will only split the XML into its component tokens. It won't parse/integrate DTDs.
Oh... if you want the source code of the regex, with some auxiliary methods:
regex to tokenize an xml or the full plain regex
Upvotes: 604
Reputation: 42689
First, to answer the direct question: Your regex has a bug since it will exclude a tag with a slash anywhere, not just at the end. For example it would exclude this valid opening tag: <a href="foo/bar.html">
because it has a slash in an attribute value.
We can fix that, but more seriously, this regex will lead to false positives, because it will also match inside comments and cdata sections, where the same characters doesn't represent a valid tag. For example:
<!-- <foo> -->
or
<![CDATA[ <foo> ]]>
Especially html strings embedded in scripts is likely to trigger false positives, and so is the regular use of <
and >
as comparison operators in JavaScript. And of course sections of html which is commented-out with <!-- -->
.
So to only match actual tags, you also need to be able to skip past comments and cdata sections. So you need the regex to also match comments and cdata, but only capture the opening tags. This is still possible using a rexep, but it becomes significantly more complex, for example:
(
<!-- .*? --> # comment
| <!\[CDATA\[ .*? \]\]> # CData section
| < \w+ ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* /> # self-closing tag
| (?<tag> < \w+ ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > ) # opening tag - captured
| </ \w+ \s* > # end tag
)
And this just for XHTML conforming to the HTML compatibility guidelines. If you want to handle arbitrary XHTML you should also handle processing instructions and internal DTD's, since they can also embed false positives. If you also want to handle HTML there are additional complexities like the <script>
-tag. And if you also want to handle invalid HTML it gets yet more complex.
Given the complexity, I would not recommend going down that road. Instead, look for an off-the-shelf (X)HTML parsing library which can solve your problem.
A parser typically uses regular expressions (or similar) under the hood to split the document into "tokens" (doctype, start tags, end tags, text content etc.). But someone else will have debugged and tested these regexes for you! Depending on the type of parser it may further build a tree structure of elements by matching start tags to end tags. This will almost certainly save you a lot of time.
The exact parser library to use depend on your language and platform and the task you are solving. If you need access to the actual tag-substrings (e.g. if you are writing a syntax highlighter for HTML) you need to use a SAX-style parser which exposes the syntax tokens directly.
If you are only performing the tag-matching in order to manually build a syntax tree of elements, then a DOM parser does this work for you. But a DOM parser does not expose the underlying tag syntax, so does not solve the exact problem you describe.
You should also consider if you need to to parse invalid HTML. This is a much more complex task, but on the wild web most of the HTML is actually invalid. Something like Pytons html5lib can parse invalid HTML.
Upvotes: 7
Reputation: 536745
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the transgression of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of regex parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection will devour your HTML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fight he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain, the song of re̸gular expression parsing will extinguish the voices of mortal man from the sphere I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful the final snuf
fing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes he c̶̮omes he comes the ichor permeates all MY FACE MY FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e
not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
Have you tried using an XML parser instead?
Moderator's Note
This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.
Upvotes: 4396
Reputation: 1028
<\s*(\w+)[^/>]*>
The parts explained:
<
: Starting character
\s*
: It may have whitespaces before the tag name (ugly, but possible).
(\w+)
: tags can contain letters and numbers (h1). Well, \w
also matches '_', but it does not hurt I guess. If curious, use ([a-zA-Z0-9]+) instead.
[^/>]*
: Anything except >
and /
until closing >
>
: Closing >
And to the fellows, who underestimate regular expressions, saying they are only as powerful as regular languages:
anbanban which is not regular and not even context free, can be matched with ^(a+)b\1b\1$
Backreferencing FTW!
Upvotes: 67
Reputation:
RegEx match open tags except XHTML self-contained tags
All other tags (and content) are skipped.
This regex does that. If you need to match only specific Open tags, make a list
in an alternation (?:p|br|<whatever tags you want>)
and replace the [\w:]+
construct
in the appropriate place below.
<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>)(*SKIP)(*FAIL))|(?:[\w:]+\b(?=((?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)*)>)\2(?<!/))|(?:(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))(*SKIP)(*FAIL))>
https://regex101.com/r/uMvJn0/1
# Mix html/xml
# https://regex101.com/r/uMvJn0/1
<
(?:
# Invisible content gets failed
(?:
(?:
# Invisible content; end tag req'd
( # (1 start)
script
| style
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (1 end)
(?:
\s+
(?>
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
)
[\S\s]*? </ \1 \s*
(?= > )
(*SKIP)(*FAIL)
)
|
# This is any open html tag we will match
(?:
[\w:]+ \b
(?=
( # (2 start)
(?:
" [\S\s]*? "
| ' [\S\s]*? '
| [^>]?
)*
) # (2 end)
>
)
\2
(?<! / )
)
|
# All other tags get failed
(?:
(?: /? [\w:]+ \s* /? )
| (?:
[\w:]+
\s+
(?:
" [\S\s]*? "
| ' [\S\s]*? '
| [^>]?
)+
\s* /?
)
| \? [\S\s]*? \?
| (?:
!
(?:
(?: DOCTYPE [\S\s]*? )
| (?: \[CDATA\[ [\S\s]*? \]\] )
| (?: -- [\S\s]*? -- )
| (?: ATTLIST [\S\s]*? )
| (?: ENTITY [\S\s]*? )
| (?: ELEMENT [\S\s]*? )
)
)
)
(*SKIP)(*FAIL)
)
>
Upvotes: 4
Reputation: 8560
If you only want the tag names, it should be possible to do this via a regular expression.
<([a-zA-Z]+)(?:[^>]*[^/] *)?>
should do what you need. But I think the solution of "moritz" is already fine. I didn't see it in the beginning.
For all downvoters: In some cases it just makes sense to use a regular expression, because it can be the easiest and quickest solution. I agree that in general you should not parse HTML with regular expressions.
But regular expressions can be a very powerful tool when you have a subset of HTML where you know the format and you just want to extract some values. I did that hundreds of times and almost always achieved what I wanted.
Upvotes: 43
Reputation: 67345
The OP doesn't seem to say what he needs to do with the tags. For example, does he need to extract inner text, or just examine the tags?
I'm firmly in the camp that says a regular expression is not the be-all, end-all text parser. I've written a large amount of text-parsing code including this code to parse HTML tags.
While it's true I'm not all that great with regular expressions, I consider regular expressions just too rigid and hard to maintain for this sort of parsing.
Upvotes: 40
Reputation: 11181
I think this might work
<[a-z][^<>]*(?:(?:[^/]\s*)|(?:\s*[^/]))>
And that could be tested here.
XML elements must follow these naming rules:
And the pattern I used is going to adhere these rules.
Upvotes: 22
Reputation: 4431
Here's a PCRE regular expression for XML/XHTML, built from a simplified EBNF syntax definition:
/
(?(DEFINE)
(?<tag> (?&tagempty) | (?&tagopen) ((?&textnode) | (?&tag) | (?&comment))* (?&tagclose))
(?<tagunnested> (?&tagempty) | (?&tagopen) ((?&textnode) | (?&comment))* (?&tagclose))
(?<textnode> [^<>]+)
(?<comment> <!--([\s\S]*?)-->)
(?<tagopen> < (?&tagname) (?&attrlist)? (?&ws)* >)
(?<tagempty> < (?&tagname) (?&ws)* (?&attrlist)? (?&ws)* \/>)
(?<tagclose> <\/ (?&tagname) (?&ws)* >)
(?<attrlist> ((?&ws)+ (?&attr))+)
(?<attr> (?&attrunquoted) | (?&attrsinglequoted) | (?&attrdoublequoted) | (?&attrempty))
(?<attrempty> (?&attrname))
(?<attrunquoted> (?&attrname) (?&ws)* = (?&ws)* (?&attrunquotedvalue))
(?<attrsinglequoted> (?&attrname) (?&ws)* = (?&ws)* ' (?&attrsinglequotedvalue) ')
(?<attrdoublequoted> (?&attrname) (?&ws)* = (?&ws)* " (?&attrdoublequotedvalue) ")
(?<tagname> (?&alphabets) ((?&alphabets) | (?&digits))*)
(?<attrname>(?&alphabets)+((?&alphabets)|(?&digits)|[:-]) )
(?<attrunquotedvalue> [^\s"'=<>`]+)
(?<attrsinglequotedvalue> [^']+)
(?<attrdoublequotedvalue> [^"]+)
(?<alphabets> [a-zA-Z])
(?<digits> [0-9])
(?<ws> \s)
)
(?&tagopen)
/x
This illustrates how to build regular expressions for context-free grammars. You can match other parts of the definition by changing the match on the last line from (?&tagopen)
to e.g. (?&tagunnested)
For XML/XHTML the consensus is no!
Credits to nikic for supplying the idea.
Upvotes: 9
Reputation: 27354
I agree that the right tool to parse XML and especially HTML is a parser and not a regular expression engine. However, like others have pointed out, sometimes using a regex is quicker, easier, and gets the job done if you know the data format.
Microsoft actually has a section of Best Practices for Regular Expressions in the .NET Framework and specifically talks about Consider[ing] the Input Source.
Regular Expressions do have limitations, but have you considered the following?
The .NET framework is unique when it comes to regular expressions in that it supports Balancing Group Definitions.
For this reason, I believe you CAN parse XML using regular expressions. Note however, that it must be valid XML (browsers are very forgiving of HTML and allow bad XML syntax inside HTML). This is possible since the "Balancing Group Definition" will allow the regular expression engine to act as a PDA.
Quote from article 1 cited above:
.NET Regular Expression Engine
As described above properly balanced constructs cannot be described by a regular expression. However, the .NET regular expression engine provides a few constructs that allow balanced constructs to be recognized.
(?<group>)
- pushes the captured result on the capture stack with the name group.(?<-group>)
- pops the top most capture with the name group off the capture stack.(?(group)yes|no)
- matches the yes part if there exists a group with the name group otherwise matches no part.These constructs allow for a .NET regular expression to emulate a restricted PDA by essentially allowing simple versions of the stack operations: push, pop and empty. The simple operations are pretty much equivalent to increment, decrement and compare to zero respectively. This allows for the .NET regular expression engine to recognize a subset of the context-free languages, in particular the ones that only require a simple counter. This in turn allows for the non-traditional .NET regular expressions to recognize individual properly balanced constructs.
Consider the following regular expression:
(?=<ul\s+id="matchMe"\s+type="square"\s*>)
(?>
<!-- .*? --> |
<[^>]*/> |
(?<opentag><(?!/)[^>]*[^/]>) |
(?<-opentag></[^>]*[^/]>) |
[^<>]*
)*
(?(opentag)(?!))
Use the flags:
(?=<ul\s+id="matchMe"\s+type="square"\s*>) # match start with <ul id="matchMe"...
(?> # atomic group / don't backtrack (faster)
<!-- .*? --> | # match xml / html comment
<[^>]*/> | # self closing tag
(?<opentag><(?!/)[^>]*[^/]>) | # push opening xml tag
(?<-opentag></[^>]*[^/]>) | # pop closing xml tag
[^<>]* # something between tags
)* # match as many xml tags as possible
(?(opentag)(?!)) # ensure no 'opentag' groups are on stack
You can try this at A Better .NET Regular Expression Tester.
I used the sample source of:
<html>
<body>
<div>
<br />
<ul id="matchMe" type="square">
<li>stuff...</li>
<li>more stuff</li>
<li>
<div>
<span>still more</span>
<ul>
<li>Another >ul<, oh my!</li>
<li>...</li>
</ul>
</div>
</li>
</ul>
</div>
</body>
</html>
This found the match:
<ul id="matchMe" type="square">
<li>stuff...</li>
<li>more stuff</li>
<li>
<div>
<span>still more</span>
<ul>
<li>Another >ul<, oh my!</li>
<li>...</li>
</ul>
</div>
</li>
</ul>
although it actually came out like this:
<ul id="matchMe" type="square"> <li>stuff...</li> <li>more stuff</li> <li> <div> <span>still more</span> <ul> <li>Another >ul<, oh my!</li> <li>...</li> </ul> </div> </li> </ul>
Lastly, I really enjoyed Jeff Atwood's article: Parsing Html The Cthulhu Way. Funny enough, it cites the answer to this question that currently has over 4k votes.
Upvotes: 296
Reputation: 4822
In shell, you can parse HTML using sed:
Related (why you shouldn't use regex match):
Upvotes: 324
Reputation: 29352
Disclaimer: use a parser if you have the option. That said...
This is the regex I use (!) to match HTML tags:
<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>
It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like <a name="badgenerator"">
, which show up on the web.
I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:
<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>
or just combine if and if not.
To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.
Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...
Upvotes: 1186
Reputation: 27363
There are some nice regexes for replacing HTML with BBCode here. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.
For example:
$store =~ s/http:/http:\/\//gi;
$store =~ s/https:/https:\/\//gi;
$baseurl = $store;
if (!$query->param("ascii")) {
$html =~ s/\s\s+/\n/gi;
$html =~ s/<pre(.*?)>(.*?)<\/pre>/\[code]$2\[\/code]/sgmi;
}
$html =~ s/\n//gi;
$html =~ s/\r\r//gi;
$html =~ s/$baseurl//gi;
$html =~ s/<h[1-7](.*?)>(.*?)<\/h[1-7]>/\n\[b]$2\[\/b]\n/sgmi;
$html =~ s/<p>/\n\n/gi;
$html =~ s/<br(.*?)>/\n/gi;
$html =~ s/<textarea(.*?)>(.*?)<\/textarea>/\[code]$2\[\/code]/sgmi;
$html =~ s/<b>(.*?)<\/b>/\[b]$1\[\/b]/gi;
$html =~ s/<i>(.*?)<\/i>/\[i]$1\[\/i]/gi;
$html =~ s/<u>(.*?)<\/u>/\[u]$1\[\/u]/gi;
$html =~ s/<em>(.*?)<\/em>/\[i]$1\[\/i]/gi;
$html =~ s/<strong>(.*?)<\/strong>/\[b]$1\[\/b]/gi;
$html =~ s/<cite>(.*?)<\/cite>/\[i]$1\[\/i]/gi;
$html =~ s/<font color="(.*?)">(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi;
$html =~ s/<font color=(.*?)>(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi;
$html =~ s/<link(.*?)>//gi;
$html =~ s/<li(.*?)>(.*?)<\/li>/\[\*]$2/gi;
$html =~ s/<ul(.*?)>/\[list]/gi;
$html =~ s/<\/ul>/\[\/list]/gi;
$html =~ s/<div>/\n/gi;
$html =~ s/<\/div>/\n/gi;
$html =~ s/<td(.*?)>/ /gi;
$html =~ s/<tr(.*?)>/\n/gi;
$html =~ s/<img(.*?)src="(.*?)"(.*?)>/\[img]$baseurl\/$2\[\/img]/gi;
$html =~ s/<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>/\[url=$baseurl\/$2]$4\[\/url]/gi;
$html =~ s/\[url=$baseurl\/http:\/\/(.*?)](.*?)\[\/url]/\[url=http:\/\/$1]$2\[\/url]/gi;
$html =~ s/\[img]$baseurl\/http:\/\/(.*?)\[\/img]/\[img]http:\/\/$1\[\/img]/gi;
$html =~ s/<head>(.*?)<\/head>//sgmi;
$html =~ s/<object>(.*?)<\/object>//sgmi;
$html =~ s/<script(.*?)>(.*?)<\/script>//sgmi;
$html =~ s/<style(.*?)>(.*?)<\/style>//sgmi;
$html =~ s/<title>(.*?)<\/title>//sgmi;
$html =~ s/<!--(.*?)-->/\n/sgmi;
$html =~ s/\/\//\//gi;
$html =~ s/http:\//http:\/\//gi;
$html =~ s/https:\//https:\/\//gi;
$html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gsi;
$html =~ s/\r\r//gi;
$html =~ s/\[img]\//\[img]/gi;
$html =~ s/\[url=\//\[url=/gi;
Upvotes: 71
Reputation: 4580
I suggest using QueryPath for parsing XML and HTML in PHP. It's basically much the same syntax as jQuery, only it's on the server side.
Upvotes: 269
Reputation: 833
Whenever I need to quickly extract something from an HTML document, I use Tidy to convert it to XML and then use XPath or XSLT to get what I need. In your case, something like this:
//p/a[@href='foo']
Upvotes: 94
Reputation: 5654
Sun Tzu, an ancient Chinese strategist, general, and philosopher, said:
It is said that if you know your enemies and know yourself, you can win a hundred battles without a single loss. If you only know yourself, but not your opponent, you may win or may lose. If you know neither yourself nor your enemy, you will always endanger yourself.
In this case your enemy is HTML and you are either yourself or regex. You might even be Perl with irregular regex. Know HTML. Know yourself.
I have composed a haiku describing the nature of HTML.
HTML has
complexity exceeding
regular language.
I have also composed a haiku describing the nature of regex in Perl.
The regex you seek
is defined within the phrase
<([a-zA-Z]+)(?:[^>]*[^/]*)?>
Upvotes: 203
Reputation: 1414
It's true that when programming it's usually best to use dedicated parsers and APIs instead of regular expressions when dealing with HTML, especially if accuracy is paramount (e.g., if your processing might have security implications). However, I don’t ascribe to a dogmatic view that XML-style markup should never be processed with regular expressions. There are cases when regular expressions are a great tool for the job, such as when making one-time edits in a text editor, fixing broken XML files, or dealing with file formats that look like but aren’t quite XML. There are some issues to be aware of, but they're not insurmountable or even necessarily relevant.
A simple regex like <([^>"']|"[^"]*"|'[^']*')*>
is usually good enough, in cases such as those I just mentioned. It's a naive solution, all things considered, but it does correctly allow unencoded >
symbols in attribute values. If you're looking for, e.g., a table
tag, you could adapt it as </?table\b([^>"']|"[^"]*"|'[^']*')*>
.
Just to give a sense of what a more "advanced" HTML regex would look like, the following does a fairly respectable job of emulating real-world browser behavior and the HTML5 parsing algorithm:
</?([A-Za-z][^\s>/]*)(?:=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)|[^>])*(?:>|$)
The following matches a fairly strict definition of XML tags (although it doesn't account for the full set of Unicode characters allowed in XML names):
<(?:([_:A-Z][-.:\w]*)(?:\s+[_:A-Z][-.:\w]*\s*=\s*(?:"[^"]*"|'[^']*'))*\s*/?|/([_:A-Z][-.:\w]*)\s*)>
Granted, these don't account for surrounding context and a few edge cases, but even such things could be dealt with if you really wanted to (e.g., by searching between the matches of another regex).
At the end of the day, use the most appropriate tool for the job, even in the cases when that tool happens to be a regex.
Upvotes: 53
Reputation: 29240
You want the first >
not preceded by a /
. Look here for details on how to do that. It's referred to as negative lookbehind.
However, a naïve implementation of that will end up matching <bar/></foo>
in this example document
<foo><bar/></foo>
Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?
Upvotes: 141
Reputation: 14518
If you need this for PHP:
The PHP DOM functions won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind.
simplehtmldom is good, but I found it a bit buggy, and it is is quite memory heavy [Will crash on large pages.]
I have never used querypath, so can't comment on its usefulness.
Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful.
For Python and Java, similar links were posted.
For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please.
Upvotes: 110
Reputation: 171
This may do:
<.*?[^/]>
Or without the ending tags:
<[^/].*?[^/]>
What's with the flame wars on HTML parsers? HTML parsers must parse (and rebuild!) the entire document before it can categorize your search. Regular expressions may be a faster / elegant in certain circumstances. My 2 cents...
Upvotes: 31
Reputation: 10174
Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works.
There is a definitive blog post about matching innermost HTML elements written by Steven Levithan.
Upvotes: 51
Reputation: 1
As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result.
Upvotes: 63