Reputation: 107498
This is a follow up to another question of mine. The solution I found worked great for every one of the test cases I threw at it, until a case showed up that eluded me the first time around.
My goal is to reformat improperly formatted tag attributes using regex (I know, probably not a fool-proof method as I'm finding out, but bear with me).
My functions:
Public Function ConvertMarkupAttributeQuoteType(ByVal html As String) As String
Dim findTags As String = "</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>"
Return Regex.Replace(html, findTags, AddressOf EvaluateTag)
End Function
Private Function EvaluateTag(ByVal match As Match) As String
Dim attributes As String = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>\S+))"
Return Regex.Replace(match.Value, attributes, "='$2'")
End Function
The regex in the EvaluateTag
function will correctly transform HTML like
<table border=2 cellpadding='2' cellspacing="1">
into
<table border='2' cellpadding='2' cellspacing='1'>
You'll notice I'm forcing attribute values to be surrounded by single quotes -- don't worry about that. The case that it breaks on is if the last attribute value doesn't have anything around it.
<table width=100 border=0>
comes out of the regex replace as
<table width='100' border='0>'
with the last single quote incorrectly outside of the tag. I've confessed before that I'm not good at regex at all; I just haven't taken the time to understand everything it can do. So, I'm asking for some help adjusting the EvaluateTag
regex so that it can handle this final case.
Thank you!
Upvotes: 0
Views: 796
Reputation: 107498
richardtallent's explanation of why the regex wasn't working pointed me in the right direction. After playing around a bit, the following replacement for the EvaluateTag function seems to be working.
Can anybody see anything problematic with it? The change I made is in the last group after the pipe. Maybe it could even more simplified further?
Private Function EvaluateTag(ByVal match As Match) As String
Dim attributes As String = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>[^>\s]+))"
Return Regex.Replace(match.Value, attributes, "='$2'")
End Function
If no one responds I'll probably accept this as the answer. Thanks again!
Upvotes: 1
Reputation: 35363
The first RegEx function will pass EvaluateTag the entire match, which is the entire HTML tag.
But EvaluateTag doesn't ignore the final greater-than character...
I'm afraid I haven't had enough caffeine yet to work through the entire expression, but this adjustment may work (added a greater-than in the character list):
Private Function EvaluateTag(ByVal match As Match) As String
Dim attributes As String = "\s*=\s*(?:(['"">])(?<g1>(?:(?!\1).)*)\1|(?<g1>\S+))"
Return Regex.Replace(match.Value, attributes, "='$2'")
End Function
Upvotes: 1