Memetican
Memetican

Reputation: 407

How to Parse an HTML STYLE Attribute with Regex?

What is the proper Regex construction (.NET flavor) to extract the attribute/value pairs from an HTML style string, while ignoring HTML entities?

margin-top:0pt;margin:0;color:#000000;margin-left:0;font-size:26pt;margin-bottom:3pt;line-height:1.15;page-break-after:avoid;font-family:"Arial";orphans:2;widows:2;text-align:left;margin-right:0

Splitting on ; and then on : would be simplest but as HTML Entities contain semicolons, this breaks on some strings. For example, entities can exist in the font-family style attribute.

font-family:"Arial";

The style string is isolated (no style="), and single-line.

Ultimately I'll be regex-grouping them in this arrangement;

match:( 
    group:( style-attribute-name ) 
    group:( style-attribute-value ) 
    )

Iterating through the groups to create a dictionary (with duplicate keys getting replaced).

My current Regex looks like this-

\s*(?<attr>[^:\s]*)\s*:\s*(?<val>[^;]*)[;]\s*

And results in mis-matches when it hits the HTML entities.

enter image description here

Upvotes: 0

Views: 746

Answers (2)

Gawil
Gawil

Reputation: 1211

I updated your regex, using balancing groups to skip ; when it is preceded by &.

Here is the regex :
(?<attr>[^:\s]*)\s*:\s*(?<val>(?:[^;&]*(?<html>&)?[^;&]*(?(html);(?<-html>)))+)(?:;|$)

Demo here

Note : I have mostly replaced [^;]* by (?:[^;&]*(?<html>&)?[^;&]*(?(html);(?<-html>)))+ in the groupe val from your regex.

Upvotes: 1

Dominic Mazur
Dominic Mazur

Reputation: 46

http://www.regextester.com https://www.mikesdotnetting.com/article/46/c-regular-expressions-cheat-sheet

These helped me when I was screwing around with regex in school, not near my computer rn so I can't easily write it for ya :/

Hope it helped!

Upvotes: 0

Related Questions