wroe12
wroe12

Reputation: 179

Regex for no space between attributes html

How to detected no space between attributes. Example:

 <div style="margin:37px;"/></div>
 <span title=''style="margin:37px;" /></span>
 <span title="" style="margin:37px;" /></span>
 <a title="u" hghghgh  title="j" >

 <a title=""gg  ff>

correct: 1,3,4 incorrect: 2,5 How to detected incorrect?

I've tried with this:

<(.*?=(['"]).*?\2)([\S].*)|(^/)>

But it's not working.

Upvotes: 5

Views: 524

Answers (4)

sina
sina

Reputation: 980

You should not use regex to parse HTML, unless for learning purpose.


http://regexr.com/3cge1

<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*/?>

This regular expression matches even if you don't have any attribute at all. It works for self-closing tags, and if the attribute has no value.


  • <\w+ Match opening < and \w characters.

  • (\s+[\w-]+(=(['"])[^"']*\3)?)* zero or more attributes that must start with a white space. It contains two parts:

    • \s+[\w-]+ attribute name after mandatory space
    • (=(['"])[^"']*\3)? optional attribute value
  • \s*/?> optional white space and optional / followed by closing >.


Here is a test for the strings:

var re = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;

! '<div style="margin:37px;"/></div>'.match(re);
false

! '<span title=\'\'style="margin:37px;" /></span>'.match(re);
true

! '<span title="" style="margin:37px;" /></span>'.match(re);
false

! '<a title="u" hghghgh  title="j" >'.match(re);
false

! '<a title=""gg  ff>'.match(re);
true

Display all incorrect tags:

var html = '<div style="margin:37px;"></div> <span title=\'\'style="margin:37px;"/><a title=""gg ff/> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j"example> <a title=""gg ff>';
var tagRegex = /<\w+[^>]*\/?>/g;
var validRegex = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;

html.match(tagRegex).forEach(function(m) {
  if(!m.match(validRegex)) {
    console.log('Incorrect', m);
  }
});

Will output

Incorrect <span title=''style="margin:37px;"/>
Incorrect <a title=""gg ff/>
Incorrect <a title="u" hghghgh title="j"example>
Incorrect <a title=""gg ff>

Update for the comments

<\w+(\s+[\w-]+(="[^"]*"|='[^']*'|=[\w-]+)?)*\s*/?>

Upvotes: 3

Mi-Creativity
Mi-Creativity

Reputation: 9654

Not sure about this I am not so experienced at regex but this looks like it is working well

JS Fiddle

<([a-z]+)(\s+[a-z\-]+(="[^"]*")?)*\s*\/?>([^<]+(<\/$1>))?

Currently <([a-z]+) will mostly work but with web component and <ng-* this would better be \w+

---------------

Output:

<div style="margin:37px;">div</div> correct

<span title=" style="margin:37px;" />span1</span> incorrect

<span title="" style="margin:37px;" />span2</span> correct

<a title="u" title="j">link</a> correct

<a title=""href="" alt="" required>test</a> incorrect

<img src="" data-abc="" required> correct

<input type=""style="" /> incorrect

Upvotes: 1

Khan
Khan

Reputation: 41

Try this regex , i think it will work

<\w*[^=]*=["'][\w;:]*["'][\s/]+[^>]*>

< - starting bracket

\w* - one or more alphanumeric character

[^=]*= - It will cover all the character till '=' shows up ["'][\w;:]*["'] - this will match two cases 1. one with single quote with having strings optional 2. one with double quote with having strings optional

[\s/]+ - match the space or '\' atleast one occurence

[^>]* - this will match all the character till '>' closing bracket

Upvotes: 1

Totem
Totem

Reputation: 7369

I got this pattern to work, finding incorrect lines 2 and 5 as you requested:

>>> import re
>>> p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'

>>> html = """
 <div style="margin:37px;"/></div>
 <span title=''style="margin:37px;" /></span>
 <span title="" style="margin:37px;" /></span>
 <a title="u" hghghgh  title="j" >

 <a title=""gg  ff>
"""

>>> bad = re.findall(p, html)
>>> print '\n'.join(bad)
<span title=''style="margin:37px;" /></span>
<a title=""gg  ff>

regex broken down:

p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'

< - starting bracket

[a-z]+\s - 1 or more lowercase letters followed by a space

[a-z]+= - 1 or more lowercase letters followed by an equals sign

[\'\"] - match a single or double quote one time

[\w;:]* - match an alphnumeric character (a-zA-Z0-9_) or a colon or semi-colon 0 or more times

[\"\'] - again match a single or double quote one time

[\w]+ - match an alphanumeric character one or more times(this catches the lack of a space you wanted to detect) ***

.* - match anything 0 or more times(gets rest of the line)

Upvotes: 1

Related Questions