Reputation: 1195

Finding everything but anchor tags within a tag with regex

I have several HTML blocks on a page set up like:

<p class="something">
    <a href="http://example.com/9999">text 1 2 3</a>
    <a href="http://example.com/2346saasdf">text 3 4 5</a>
    (9999)
    <a href="http://example.com/sad3ws">text 5 6 7random</a>
</p>

I want to get the digit that is in the parentheses, with them. I have to admit I've never really used regex before -- read about it, seen examples of it but haven't used it myself. Anyway, I created this with a little bit of looking around:

(.*?)

That correctly gets the entire  block, but again, I just want the (9999) (with parentheses intact). I'm not really sure how to get it.

Assuming that other elements on the page could also have digits in parentheses (but they won't be included in this exact format), and that the HTML will remain valid and consistent, how can I get it?

I understand this is probably easy for someone who has used regex before, but for the solution, I'd appreciate a little detail on what each character captures so I can learn from it.

Upvotes: 1

Answers (3)

Lasse V. Karlsen

Reputation: 391704

With most regex engines, parenthesis means grouping parts of the expression, not matching parenthesis in the input.

As such, this (which you say work, somewhat):

<p class="something">(.*?)</p>
                     ^   ^
                     |   |
                     +---+--- creates a group

Since this "works", you can just extract the contents of that group, but that would give you the parenthesis as well.

I would try this:

<p class="something">\((.*?)\)</p>
                     ^^     ^^
                      |     |
                      +-----+-- matches (...)

And then extract the contents of the first group.

Now, as for what each character means:

<p class="something">\((.*?)\)</p>

<p class="something">                 match <p class="something">
                     \(               match (, without the \ it would be a group
                       (              create a group
                        .             match one character (usually not newlines)
                         *            ... repeated zero or more times
                          ?           ... in a non-greedy way
                           )          end the group
                            \)        match )
                              </p>    match </p>

Upvotes: 1

Mark Elliot

Reputation: 77104

Don't use regex to parse HTML.

Instead, use an HTML parser, then simply read the text (non-tag) content within the desired  block.

jQuery is a pretty decent HTML parser, so you can get the desired text stored in a variable x using:

var x = $('p').clone().find('a').remove().end().text();

working example

If you can't use jQuery to make your life easy for whatever reason, you can use raw JavaScript at the DOM:

var y = document.getElementsByTagName("p")[0].cloneNode(true);
var x = "";
for(var k in y.childNodes){ 
    if(y.childNodes[k].nodeType == 3){ 
        x += y.childNodes[k].textContent; 
    }
}
x = x.trim();

working example

Upvotes: 6

z1x2

Reputation: 793

If you really want to use Regex, the following pattern might work for you.

var re = /<\/a>\s*([^\s]+)\s*<a /ig;

Upvotes: 0

Finding everything but anchor tags within a <p> tag with regex

Answers (3)

Related Questions

Finding everything but anchor tags within a &lt;p&gt; tag with regex

Answers (3)

Related Questions

Finding everything but anchor tags within a <p> tag with regex