Reputation: 1044
I have for example this image tag:
<img src="http://... .jpg" al="myImage" hhh="aaa" />
and I mantain, for example, for a generally image tag the list of all valid attributes
L1=(alt, src, width, height, align, border, hspace, longdesc, vpace)
I am parsing the img tag and I am getting the used attributes like this:
L2=(src, al, hhh)
How can I programaticaly validate the image tag? So that the 'al' attribute should become 'alt' ('alt' attribute is more like than 'align' that contains much more characters) and the 'hhh' tag will disappear (because there is no attribute to be like it)?
For result the tag should look like this:
<img src="http://... .jpg" alt="myImage" />
Thanks.
Jeff
Upvotes: 1
Views: 567
Reputation: 9124
You could use Linq2Xml to easily parse the code:
XElement doc = XElement.Parse(...)
Then correct the wrong attributes using a best-match algorithm against a valid attributes in-memory dictionary.
edit: I wrote and tested this simplified best-matched algorithm (sorry, it's VB):
Dim validTags() As String =
{
"width",
"height",
"img"
}
(simplified, you should create a more structured dictionary with tags and possible attributes for each tag)
Dim maxMatch As Integer = 0
Dim matchedTag As String = Nothing
For Each Tag As String In validTags
Dim match As Integer = checkMatch(Tag, source)
If match > maxMatch Then
maxMatch = match
matchedTag = Tag
End If
Next
Debug.WriteLine("matched tag {0} matched % {1}", matchedTag, maxMatch)
The above code calls a method to determine the percentage the source string equals any valid tag.
Private Function checkMatch(ByVal tag As String, ByVal source As String) As Integer
If tag = source Then Return 100
Dim maxPercentage As Integer = 0
For index As Integer = 0 To tag.Length - 1
Dim tIndex As Integer = index
Dim sIndex As Integer = 0
Dim matchCounter As Integer = 0
While True
If tag(tIndex) = source(sIndex) Then
matchCounter += 1
End If
tIndex += 1
sIndex += 1
If tIndex + 1 > tag.Length OrElse sIndex + 1 > source.Length Then
Exit While
End If
End While
Dim percentage As Integer = CInt(matchCounter * 100 / Math.Max(tag.Length, source.Length))
If percentage > maxPercentage Then maxPercentage = percentage
Next
Return maxPercentage
End Function
The above method, given a source string and a tag, finds the best match percentage comparing the single characters.
Given "widt" as input, it finds "width" as the best match with a 80% match value.
Upvotes: 1
Reputation: 2965
The parsing of the tag is the hardest part, seeing as you've done that, all you have to do now is loop through the elements, check them against an array of valid ones, if they aren't valid check them against an array of commonly misspelt items and replace/delete as necessary.
Someting similar to:
String[] ValidItems = {"alt", "src", "width", "height", "align", "border", "hspace", "longdesc", "vpace"};
Dictionary<String, String> MispeltItems = { {"al", "alt" } };
for(int i = ImgTagAttributes-1; i >= 0; i--)
{
var element = ImgTagAttributes[i];
if(!ValidItems.Contains(element))
{
if(MispeltItems.ContainsKey(element))
{
ImgTagElements.Replace(element, MispeltItems[element].Value);
//Or use remove and insert.
}
else
{
ImgTagElements.RemoveAt(i);
}
}
}
(wrote this in stack overflow, if there's any errors just say, it's just so you can get a basic idea)
Upvotes: 1