Reputation: 187
I'm newbie to regular expressions, trying to filter the HTML tags keeping only required (src / href / style) attribute with their values and remove unnecessary attributes. While googling I found a regular expression to keep only "src" attribute, hence my modified expression is as follows:
<([a-z][a-z0-9]*)(?:[^>]*(\s(src|href|style)=['\"][^'\"]*['\"]))?[^>]*?(\/?)>
Its working fine but the only problem is, if one tag contains more than one required attribute then it keeps only the last matched single attribute and discards the rest.
I'm trying to clean following text
<title>Hello World</title>
<div fadeout"="" style="margin:0px;" class="xyz">
<img src="abc.jpg" alt="" />
<p style="margin-bottom:10px;">
The event is celebrating its 50th anniversary Kö
<a style="margin:0px;" href="http://www.germany.travel/">exhibition grounds in Cologne</a>.
</p>
<p style="padding:0px;"></p>
<p style="color:black;">
<strong>A festival for art lovers</strong>
</p>
</div>
at https://regex101.com/#javascript using aforementioned expression with <$1$2$4>
as substitution string and getting following output:
<title>Hello World</title>
<div style="margin:0px;">
<img src="abc.jpg"/>
<p style="margin-bottom:10px;">
The event is celebrating its 50th anniversary Kö
<a href="http://www.germany.travel/">exhibition grounds in Cologne</a>.
</p>
<p style="padding:0px;"></p>
<p style="color:black;">
<strong>A festival for art lovers</strong>
</p>
</div>
Problem is "style" attribute is discarded from anchor tag.
I have tried to replicate the (\s(src|href|style)=['\"][^'\"]*['\"])
block using * operator, {3} selector and much more but in vain.
Any suggestions???
Upvotes: 3
Views: 3903
Reputation: 784
Here you go, based on your original regex:
<([a-z][a-z0-9]*?)(?:[^>]*?((?:\s(?:src|href|style)=['\"][^'\"]*['\"]){0,3}))[^>]*?(\/?)>
Group 1 is the tag name, group 2 are the attributes, and group 3 is the /
if there is one. I couldn't get it to work with non-allowed attributes interleaved with allowed attributes e.g. <a href="foo" class="bar" src="baz" />
. I don't think it can be done.
Edit: Per @AhmadAhsan's corrections below the regex should be:
var html = `<div fadeout"="" style="margin:0px;" class="xyz">
<img src="abc.jpg" alt="" />
<p style="margin-bottom:10px;">
The event is celebrating its 50th anniversary Kö
<a style="margin:0px;" href="http://www.germany.travel/">exhibition grounds in Cologne</a>.
</p>
<p style="padding:0px;"></p>
<p style="color:black;">
<strong>A festival for art lovers</strong>
</p>
</div>`
console.log(
html.replace(/<([a-z][a-z0-9]*)(?:[^>]*?((?:\s(?:src|href|style)=['\"][^'\"]*['\"]){0,3}))[^>]*?(\/?)>/, '')
)
Upvotes: 1
Reputation: 74
@AhmadAhsan here is demo to fix your issue using DOM manipulation: https://jsfiddle.net/pu1hsdgn/
<script src="https://code.jquery.com/jquery-1.9.1.js"></script>
<script>
var whitelist = ["src", "href", "style"];
$( document ).ready(function() {
function foo(contents) {
var temp = document.createElement('div');
var html = $.parseHTML(contents);
temp = $(temp).html(contents);
$(temp).find('*').each(function (j) {
var attributes = this.attributes;
var i = attributes.length;
while( i-- ) {
var attr = attributes[i];
if( $.inArray(attr.name,whitelist) == -1 )
this.removeAttributeNode(attr);
}
});
return $(temp).html();
}
var raw = '<title>Hello World</title><div style="margin:0px;" fadeout"="" class="xyz"><img src="abc.jpg" alt="" /><p style="margin-bottom:10px;">The event is celebrating its 50th anniversary Kö <a href="http://www.germany.travel/" style="margin:0px;">exhibition grounds in Cologne</a>.</p><p style="padding:0px;"></p><p style="color:black;"><strong>A festival for art lovers</strong></p></div>'
alert(foo(raw));
});
</script>
Upvotes: 5