Cesco
Cesco

Reputation: 3870

How to search text surrounded by double-quotes with RegEx?

I have a string with some HTML code in, for example:

This is <strong id="c1-id-8">some</strong> <em id="c1-id-9">text</em>

I need to strip out the id attribute from every HTML tag, but I have zero experience with regular expressions, so I searched here and there from the internet and I wrote this pattern: [\s]+id=\".*\"

Unfortunately it's not working as I would expect. Infact, I was hoping that the regular expression would catch the id=" followed by any character repeated for any number of times and terminated with the nearest double quote; Practically in this example I was expecting to catch id="c1-id-8" and id="c1-id-9". But instead the pattern returned me the substring id="c1-id-8">some</strong> <em id="c1-id-9", it finds the first occurrence of id=" and the last occurrence of a double quote character.

Could you tell me what is wrong in my pattern and how to fix it, please? Thank you very much

Upvotes: 7

Views: 16583

Answers (5)

nachito
nachito

Reputation: 7035

The quantifier .* in your regex is greedy (meaning it matches as much as it can). In order to match the minimum required you could use something like /\s+id=\"[^\"]*\"/. The brackets [] indicate a character class. So it will match everything inside of the brackets. The carat [^] at the beginning of your character class is a negation, meaning it will match everything except what is specified in the brackets.

An alternative would be to tell the .* quantifier to be lazy by changing it to .*? which will match as little as it can.

Upvotes: 13

Kent
Kent

Reputation: 195049

example with grep: (but the point is the expression)

kent$  echo 'This is <strong id="c1-id-8">some</strong> <em id="c1-id-9">text</em>'|grep -oP '(?<= id=")[^"]*(?=">)'
c1-id-8
c1-id-9

Upvotes: 1

db48x
db48x

Reputation: 3168

A parser is the best solution in the general case, but they to take time to write. There are cases where writing one would take more time than the parser would save; perhaps this is such a time.

What you want is a either a non-greedy match or a more precise match. /[\s]+id=\".?\"/ will do the trick, but [\s]+id=\"[^"]\" will be faster.

Note that a full regex that takes into account the possibility of escaped quotes characters, allows single quotes instead of double quotes, and allows for the absence of quotes entirely would be much more complex. You would really want a parser at that point.

Upvotes: 1

Tim Pietzcker
Tim Pietzcker

Reputation: 336128

In .* the asterisk is a greedy quantifier and matches as many characters as it can, so it only stops at the last " it finds.

You can either use ".*?" to make it lazy, or (better IMO), use "[^"]*" to make the match explicit:

"      # match a quote
[^"]*  # match any number of characters except quotes
"      # match a quote

You might still need to escape the quotes if you're building the regex from a string; otherwise that's not necessary since quotes are no special characters in a regex.

Upvotes: 4

Jason Gennaro
Jason Gennaro

Reputation: 34855

If you know that your id is always 7 characters, you could do this.

/\sid=".{7}"/g

So..

var a = 'This is <strong id="c1-id-8">some</strong> <em id="c1-id-9">text</em>';

var b = a.replace(/\sid=".{7}"/g, '');

document.write(b);

Example: http://jsfiddle.net/jasongennaro/XPMze/

Check the inspector to see the ids removed.

Upvotes: 0

Related Questions