Damian Czapiewski
Damian Czapiewski

Reputation: 881

RegExp capturing HTML-like attributes data

Given I have a string that represents HTML-like attributes, e.g. 'attr="val" attr2="val2"', I'd like to get attribute names and values, yet it gets complicated as a value can contain space (thus no splitting by space is to do the work), as well it can contain both ' and " (note that a string itself can be surrounded with either ' or "), finally there can occur quotes preceded by backslash, i.e. \' or \". I managed to capture almost everything except the last one - a value containing \" or \'.

Regexp I've made as far is here: https://regex101.com/r/Z7q73R/1
What I aim at is to turn the string 'attr="val" attr2="val\"2a\" val2b"' into the object {attr: 'val', attr2: 'val"2a" val2b'}.

Upvotes: 1

Views: 64

Answers (3)

user557597
user557597

Reputation:

You could also do it like this.

Readable regex

 ( \w+ )                       # (1), Attribute
 \s* 
 =                             # =
 \s* 
 ( ["'] )                      # (2), Value quote ', or "

 (                             # (3 start), Value
      [^"'\\]*                      # 0 to many not ",', or \ chars
      (?:                           # --------
           (?:                           # One of ...
                \\ [\S\s]                     # Escape + anything
             |                              # or,
                (?! \2 | \\ )                 # Not the value quote, nor escape
                [\S\s] 
           )                             # -----------
           [^"'\\]*                      # 0 to many not ",', or \ chars
      )*                            # Do 0 to many times
 )                             # (3 end)

 \2                            #  Value quote ', or "

var str = "attr1=\"\\'val\\'\\\"1\\\"\" attr2='val2a \\'hello\\' \\\"yo\\\" val2b'\n" +
"attr3=\"val\" attr4=\"val\\\"2a\\\" val2b\"\n";

console.log( str );

var re = /(\w+)\s*=\s*(["'])([^"'\\]*(?:(?:\\[\S\s]|(?!\2|\\)[\S\s])[^"'\\]*)*)\2/g;

while ((m = re.exec(str)) !== null) {
    if (m.index === re.lastIndex)
        re.lastIndex++;

    var atr = m[1];
    var val = m[3];
    // Remove escapes if needed
    val = val.replace(/([^\\'"]|(?=\\["']))((?:\\\\)*)\\(["'])/g, "$1$2$3");

    console.log( atr + " => " + val );
}

Upvotes: 0

Damian Czapiewski
Damian Czapiewski

Reputation: 881

Thanks to @revo, I've done proper code. I show it below for the sake of descedants.

const regex = /(\w+)=(?:"([^\\"]*(?:\\.[^\\"]*)*)"|'([^\\']*(?:\\.[^\\']*)*)')/gm;
const str = `attr1="\\'val\\'\\"1\\"" attr2='val2a \\'hello\\' \\"yo\\" val2b'`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    console.log(m[1] + ' => ' + ( m[2] ? m[2] : m[3] ))
}

Upvotes: 0

revo
revo

Reputation: 48761

If we assume all attributes values are enclosed within double-quotes, names are consisted of word characters ([a-zA-Z0-9_]) and they are separated by an space character, at least... then below regex matches as expected:

(\w+)="([^\\"]*(?:\\.[^\\"]*)*)"

Breaking down [^\\"]*(?:\\.[^\\"]*)* chunk:

  • [^\\"]* Match any thing except backslash and "
  • (?: Start of non-capturing group
    • \\. Match an escaped character
    • [^\\"]* Match any thing except backslash and "
  • )* End of non-capturing group, repeat as many as possible

JS code:

var str = `'attr="val" attr2="val2"'`;
var re = /(\w+)="([^\\"]*(?:\\.[^\\"]*)*)"/g;

while ((m = re.exec(str)) !== null) {
    if (m.index === re.lastIndex)
        re.lastIndex++;
    console.log(m[1] + " => " + m[2])
}

Upvotes: 1

Related Questions