slier
slier

Reputation: 6750

Regex Enforcing match

Ok i got this regex:

^[\w\s]+=["']\w+['"]

Now the regex will match:

a href='google'

a href="google"

and also

a href='google"

How can i enforce regex to match its quote?
If first quote is single quote, how can i make the last quote also a single quote not a double quote

Upvotes: 1

Views: 102

Answers (5)

michid
michid

Reputation: 10824

Try this:

^[\w\s]+="\w+"|^[\w\s]+='\w+'

Upvotes: 0

taw
taw

Reputation: 18861

What exactly do you want to match? It sounds you want to match:

  • word (tagname)
  • mandatory whitespace
  • word (attr name)
  • optional whitespace
  • =
  • optional whitespace
  • either single quoted or double quoted anything (attr value)

That would be: ^(\w+)\s+(\w+)\s*=\s*(?:'([^']*)'|"([^"]*)")

This will allow matches like:

  • a href='' - empty attr
  • a href='Hello world' - spaces and other non-word characters in quoted part
  • a href="one 'n two" - quotes of different kind in quoted part
  • a href = 'google' - spaces on both sides of =

And disallow things like these that your original regexp allows:

  • a b c href='google' - extra words
  • ='google' - only spaces on the left
  • href='google' - only attr on the left

It still doesn't sound exactly right - you're trying to match a tag with exactly one attribute?

With this regexp, tag name will be in $1, attr name in $2, and attr value in either $3 or $4 (the other being nil - most languages distinguish group not taken with nil vs group taken but empty with "" if you need it).

Regexp that would ensure attr value gets in the same group would be messier if you wanted to allow single quotes in doubly quoted attr value and vice verse - something like ^(\w+)\s+(\w+)\s*=\s*(['"])((?:(?!\3).)*)\3 ((?!) is zero-width negative look-ahead - (?:(?!\3).) means something like [^\3] except the latter isn't supported).

If you don't care about this ^(\w+)\s+(\w+)\s*=\s*(['"])(['"]*)\3 will do just fine (for both $3 will be quote type, and $4 attr value).

By the way re (["'])\w+?\1 above - \w doesn't match quotes, so this ? doesn't change anything.

Having said all that, use a real HTML parser ;-)

These regexps will work in Perl and Ruby. Other languages usually copy Perl's regexp system, but often introduce minor changes so some adjustments might be necessary. Especially the one with negative look-aheads might be unsupported.

Upvotes: 0

Katriel
Katriel

Reputation: 123762

Read about backreferences.

^[\w\s]+=(["'])\w+?\1

Note that you want to put a ? after the second + or else it will be greedy. However, in general this is not the right way to parse HTML. Use Beautiful Soup.

Upvotes: 6

AllenG
AllenG

Reputation: 8190

Replace the ['"] with \1 to use a back reference (capture group)

^[\w\s]+=["']\w+\1

Upvotes: 0

ternaryOperator
ternaryOperator

Reputation: 833

I am afraid you will have to do it the long way:

^[\w\s]+=("\w+"|'\w+')

More technically, ensuring correct matching / nesting of quotes is not a problem for a regular grammar so for more complex problems you would have to use a proper parser (or perl6 style extended regular expression but they technically do not class as regular expressions).

Upvotes: 0

Related Questions