Jordan
Jordan

Reputation: 3996

RegExp get attributes but not the tag name

I'm trying to get attributes from a JavaScript String with RegExp but I've a last problem.

I can get attributes with or without values, I can get attributes if space between them is forgotten but my RegExp also get the tag name as an attribute.

Live example: http://regex101.com/r/zX5dJ7/3

the regexp: (\s*\w+(?:=\"[^\"]*(?:\")?)?)

example html: <div name="value"otherattribute foo="bar/>

Is there a way to ask the RegExp to avoid the tag name ?

EDIT:

If the HTML is this:

<meta charset="utf-8" alone foo="tab"/> <meta charset2="utf-8"foo2="tab"/> <meta charset3="utf-8"alone2 foo3="tab unclosed/>

I want to catch every attributes like this:

  1. charset="utf-8",
  2. alone,
  3. foo="tab",
  4. charset2="utf-8",
  5. foo2="tab",
  6. charset3="utf-8",
  7. alone2,
  8. foo3="tab unclosed/>

My previous RegExp work well but she catch the tag name, I just want to make the regexp avoid tag name.

Upvotes: 1

Views: 109

Answers (3)

Kriszti&#225;n Balla
Kriszti&#225;n Balla

Reputation: 20361

This is the best I can come up with:

([<\w\-]+(?:=)?(?:"|')?[\w\-]+(?:"|')?)

You will have to skip matches that begin with < after using the regex.

DEMO: http://regex101.com/r/aL1sQ0/1

Edit: Final solution by Jordan himself: (?:<\w+)?(\s*\w+(?:=\"[^\"]*(?:\")?)?)?

Upvotes: 1

mechalynx
mechalynx

Reputation: 1302

Assuming properly formatted HTML (see my comment in the OP of why we should assumed formatted HTML), this regex will parse everything you want and will even allow a "<" in the tag name so you can easily get rid of the tag and know what's a tag and what isn't

(\w+(=\".*?\"|)|<\w+)

and in action

Parsing randomly malformed HTML is really NOT a job for regex. I cite here the countless cries of pain of many a regexper when they get asked the question of "How can I parse HTML with regular expressions?". Search stackoverflow for such questions and see what people answer. You'll see exactly why we should assume non-malformed HTML.

As stated above, after you get your matches and put them in an array or something, you can check for any string that starts with "<" and you'll know its a tag - the rest of the attributes are captured along with their contents, so no worries there.

Upvotes: 0

trainoasis
trainoasis

Reputation: 6720

If you want to get everything in between certain TAG and its CLOSING you could use

(?:<\w*)(.*)\/> 

Then you can extract whatever you want from in between. If you need further info let me know

Upvotes: 1

Related Questions