Reputation: 179
How to detected no space between attributes. Example:
<div style="margin:37px;"/></div>
<span title=''style="margin:37px;" /></span>
<span title="" style="margin:37px;" /></span>
<a title="u" hghghgh title="j" >
<a title=""gg ff>
correct: 1,3,4
incorrect: 2,5
How to detected incorrect?
I've tried with this:
<(.*?=(['"]).*?\2)([\S].*)|(^/)>
But it's not working.
Upvotes: 5
Views: 524
Reputation: 980
You should not use regex to parse HTML, unless for learning purpose.
<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*/?>
This regular expression matches even if you don't have any attribute at all. It works for self-closing tags, and if the attribute has no value.
<\w+
Match opening <
and \w
characters.
(\s+[\w-]+(=(['"])[^"']*\3)?)*
zero or more attributes that must start with a white space. It contains two parts:
\s+[\w-]+
attribute name after mandatory space(=(['"])[^"']*\3)?
optional attribute value\s*/?>
optional white space and optional /
followed by closing >
.
Here is a test for the strings:
var re = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;
! '<div style="margin:37px;"/></div>'.match(re);
false
! '<span title=\'\'style="margin:37px;" /></span>'.match(re);
true
! '<span title="" style="margin:37px;" /></span>'.match(re);
false
! '<a title="u" hghghgh title="j" >'.match(re);
false
! '<a title=""gg ff>'.match(re);
true
var html = '<div style="margin:37px;"></div> <span title=\'\'style="margin:37px;"/><a title=""gg ff/> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j"example> <a title=""gg ff>';
var tagRegex = /<\w+[^>]*\/?>/g;
var validRegex = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;
html.match(tagRegex).forEach(function(m) {
if(!m.match(validRegex)) {
console.log('Incorrect', m);
}
});
Will output
Incorrect <span title=''style="margin:37px;"/>
Incorrect <a title=""gg ff/>
Incorrect <a title="u" hghghgh title="j"example>
Incorrect <a title=""gg ff>
<\w+(\s+[\w-]+(="[^"]*"|='[^']*'|=[\w-]+)?)*\s*/?>
Upvotes: 3
Reputation: 9654
Not sure about this I am not so experienced at regex but this looks like it is working well
<([a-z]+)(\s+[a-z\-]+(="[^"]*")?)*\s*\/?>([^<]+(<\/$1>))?
Currently <([a-z]+)
will mostly work but with web component and <ng-*
this would better be \w+
Output:
<div style="margin:37px;">div</div> correct <span title=" style="margin:37px;" />span1</span> incorrect <span title="" style="margin:37px;" />span2</span> correct <a title="u" title="j">link</a> correct <a title=""href="" alt="" required>test</a> incorrect <img src="" data-abc="" required> correct <input type=""style="" /> incorrect
Upvotes: 1
Reputation: 41
Try this regex , i think it will work
<\w*[^=]*=["'][\w;:]*["'][\s/]+[^>]*>
<
- starting bracket
\w*
- one or more alphanumeric character
[^=]*=
- It will cover all the character till '=' shows up
["'][\w;:]*["']
- this will match two cases
1. one with single quote with having strings optional
2. one with double quote with having strings optional
[\s/]+
- match the space or '\' atleast one occurence
[^>]*
- this will match all the character till '>' closing bracket
Upvotes: 1
Reputation: 7369
I got this pattern to work, finding incorrect lines 2 and 5 as you requested:
>>> import re
>>> p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'
>>> html = """
<div style="margin:37px;"/></div>
<span title=''style="margin:37px;" /></span>
<span title="" style="margin:37px;" /></span>
<a title="u" hghghgh title="j" >
<a title=""gg ff>
"""
>>> bad = re.findall(p, html)
>>> print '\n'.join(bad)
<span title=''style="margin:37px;" /></span>
<a title=""gg ff>
regex broken down:
p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'
<
- starting bracket
[a-z]+\s
- 1 or more lowercase letters followed by a space
[a-z]+=
- 1 or more lowercase letters followed by an equals sign
[\'\"]
- match a single or double quote one time
[\w;:]*
- match an alphnumeric character (a-zA-Z0-9_) or a colon or semi-colon 0 or more times
[\"\']
- again match a single or double quote one time
[\w]+
- match an alphanumeric character one or more times(this catches the lack of a space you wanted to detect) ***
.*
- match anything 0 or more times(gets rest of the line)
Upvotes: 1