Reputation: 407
What is the proper Regex construction (.NET flavor) to extract the attribute/value pairs from an HTML style string, while ignoring HTML entities?
margin-top:0pt;margin:0;color:#000000;margin-left:0;font-size:26pt;margin-bottom:3pt;line-height:1.15;page-break-after:avoid;font-family:"Arial";orphans:2;widows:2;text-align:left;margin-right:0
Splitting on ;
and then on :
would be simplest but as HTML Entities contain semicolons, this breaks on some strings. For example, entities can exist in the font-family
style attribute.
font-family:"Arial";
The style string is isolated (no style="
), and single-line.
Ultimately I'll be regex-grouping them in this arrangement;
match:(
group:( style-attribute-name )
group:( style-attribute-value )
)
Iterating through the groups to create a dictionary (with duplicate keys getting replaced).
My current Regex looks like this-
\s*(?<attr>[^:\s]*)\s*:\s*(?<val>[^;]*)[;]\s*
And results in mis-matches when it hits the HTML entities.
Upvotes: 0
Views: 746
Reputation: 1211
I updated your regex, using balancing groups to skip ;
when it is preceded by &
.
Here is the regex :
(?<attr>[^:\s]*)\s*:\s*(?<val>(?:[^;&]*(?<html>&)?[^;&]*(?(html);(?<-html>)))+)(?:;|$)
Demo here
Note : I have mostly replaced [^;]*
by (?:[^;&]*(?<html>&)?[^;&]*(?(html);(?<-html>)))+
in the groupe val
from your regex.
Upvotes: 1
Reputation: 46
http://www.regextester.com https://www.mikesdotnetting.com/article/46/c-regular-expressions-cheat-sheet
These helped me when I was screwing around with regex in school, not near my computer rn so I can't easily write it for ya :/
Hope it helped!
Upvotes: 0