Reputation: 1195
I have several HTML blocks on a page set up like:
<p class="something">
<a href="http://example.com/9999">text 1 2 3</a>
<a href="http://example.com/2346saasdf">text 3 4 5</a>
(9999)
<a href="http://example.com/sad3ws">text 5 6 7random</a>
</p>
I want to get the digit that is in the parentheses, with them. I have to admit I've never really used regex before -- read about it, seen examples of it but haven't used it myself. Anyway, I created this with a little bit of looking around:
<p class="something">(.*?)</p>
That correctly gets the entire <p>
block, but again, I just want the (9999)
(with parentheses intact). I'm not really sure how to get it.
Assuming that other elements on the page could also have digits in parentheses (but they won't be included in this exact format), and that the HTML will remain valid and consistent, how can I get it?
I understand this is probably easy for someone who has used regex before, but for the solution, I'd appreciate a little detail on what each character captures so I can learn from it.
Upvotes: 1
Views: 537
Reputation: 391306
With most regex engines, parenthesis means grouping parts of the expression, not matching parenthesis in the input.
As such, this (which you say work, somewhat):
<p class="something">(.*?)</p>
^ ^
| |
+---+--- creates a group
Since this "works", you can just extract the contents of that group, but that would give you the parenthesis as well.
I would try this:
<p class="something">\((.*?)\)</p>
^^ ^^
| |
+-----+-- matches (...)
And then extract the contents of the first group.
Now, as for what each character means:
<p class="something">\((.*?)\)</p>
<p class="something"> match <p class="something">
\( match (, without the \ it would be a group
( create a group
. match one character (usually not newlines)
* ... repeated zero or more times
? ... in a non-greedy way
) end the group
\) match )
</p> match </p>
Upvotes: 1
Reputation: 77034
Don't use regex to parse HTML.
Instead, use an HTML parser, then simply read the text (non-tag) content within the desired <p>
block.
jQuery is a pretty decent HTML parser, so you can get the desired text stored in a variable x
using:
var x = $('p').clone().find('a').remove().end().text();
If you can't use jQuery to make your life easy for whatever reason, you can use raw JavaScript at the DOM:
var y = document.getElementsByTagName("p")[0].cloneNode(true);
var x = "";
for(var k in y.childNodes){
if(y.childNodes[k].nodeType == 3){
x += y.childNodes[k].textContent;
}
}
x = x.trim();
Upvotes: 6
Reputation: 793
If you really want to use Regex, the following pattern might work for you.
var re = /<\/a>\s*([^\s]+)\s*<a /ig;
Upvotes: 0