Reputation: 881
Given I have a string that represents HTML-like attributes, e.g. 'attr="val" attr2="val2"'
, I'd like to get attribute names and values, yet it gets complicated as a value can contain space (thus no splitting by space is to do the work), as well it can contain both '
and "
(note that a string itself can be surrounded with either '
or "
), finally there can occur quotes preceded by backslash, i.e. \'
or \"
. I managed to capture almost everything except the last one - a value containing \"
or \'
.
Regexp I've made as far is here: https://regex101.com/r/Z7q73R/1
What I aim at is to turn the string 'attr="val" attr2="val\"2a\" val2b"'
into the object {attr: 'val', attr2: 'val"2a" val2b'}
.
Upvotes: 1
Views: 64
Reputation:
You could also do it like this.
Readable regex
( \w+ ) # (1), Attribute
\s*
= # =
\s*
( ["'] ) # (2), Value quote ', or "
( # (3 start), Value
[^"'\\]* # 0 to many not ",', or \ chars
(?: # --------
(?: # One of ...
\\ [\S\s] # Escape + anything
| # or,
(?! \2 | \\ ) # Not the value quote, nor escape
[\S\s]
) # -----------
[^"'\\]* # 0 to many not ",', or \ chars
)* # Do 0 to many times
) # (3 end)
\2 # Value quote ', or "
var str = "attr1=\"\\'val\\'\\\"1\\\"\" attr2='val2a \\'hello\\' \\\"yo\\\" val2b'\n" +
"attr3=\"val\" attr4=\"val\\\"2a\\\" val2b\"\n";
console.log( str );
var re = /(\w+)\s*=\s*(["'])([^"'\\]*(?:(?:\\[\S\s]|(?!\2|\\)[\S\s])[^"'\\]*)*)\2/g;
while ((m = re.exec(str)) !== null) {
if (m.index === re.lastIndex)
re.lastIndex++;
var atr = m[1];
var val = m[3];
// Remove escapes if needed
val = val.replace(/([^\\'"]|(?=\\["']))((?:\\\\)*)\\(["'])/g, "$1$2$3");
console.log( atr + " => " + val );
}
Upvotes: 0
Reputation: 881
Thanks to @revo, I've done proper code. I show it below for the sake of descedants.
const regex = /(\w+)=(?:"([^\\"]*(?:\\.[^\\"]*)*)"|'([^\\']*(?:\\.[^\\']*)*)')/gm;
const str = `attr1="\\'val\\'\\"1\\"" attr2='val2a \\'hello\\' \\"yo\\" val2b'`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
console.log(m[1] + ' => ' + ( m[2] ? m[2] : m[3] ))
}
Upvotes: 0
Reputation: 48761
If we assume all attributes values are enclosed within double-quotes, names are consisted of word characters ([a-zA-Z0-9_]
) and they are separated by an space character, at least... then below regex matches as expected:
(\w+)="([^\\"]*(?:\\.[^\\"]*)*)"
Breaking down [^\\"]*(?:\\.[^\\"]*)*
chunk:
[^\\"]*
Match any thing except backslash and "
(?:
Start of non-capturing group
\\.
Match an escaped character[^\\"]*
Match any thing except backslash and "
)*
End of non-capturing group, repeat as many as possibleJS code:
var str = `'attr="val" attr2="val2"'`;
var re = /(\w+)="([^\\"]*(?:\\.[^\\"]*)*)"/g;
while ((m = re.exec(str)) !== null) {
if (m.index === re.lastIndex)
re.lastIndex++;
console.log(m[1] + " => " + m[2])
}
Upvotes: 1