Jeff Norman
Jeff Norman

Reputation: 1044

Using C#, how can I manually validate a html tag?

I have for example this image tag:

<img src="http://... .jpg" al="myImage" hhh="aaa" />

and I mantain, for example, for a generally image tag the list of all valid attributes

L1=(alt, src, width, height, align, border, hspace, longdesc, vpace)

I am parsing the img tag and I am getting the used attributes like this:

L2=(src, al, hhh)

How can I programaticaly validate the image tag? So that the 'al' attribute should become 'alt' ('alt' attribute is more like than 'align' that contains much more characters) and the 'hhh' tag will disappear (because there is no attribute to be like it)?

For result the tag should look like this:

<img src="http://... .jpg" alt="myImage" />

Thanks.

Jeff

Upvotes: 1

Views: 567

Answers (2)

vulkanino
vulkanino

Reputation: 9124

You could use Linq2Xml to easily parse the code:

XElement doc = XElement.Parse(...)

Then correct the wrong attributes using a best-match algorithm against a valid attributes in-memory dictionary.

edit: I wrote and tested this simplified best-matched algorithm (sorry, it's VB):

Dim validTags() As String =
            {
                "width",
                "height",
                "img"
            }

(simplified, you should create a more structured dictionary with tags and possible attributes for each tag)

Dim maxMatch As Integer = 0
Dim matchedTag As String = Nothing
For Each Tag As String In validTags
    Dim match As Integer = checkMatch(Tag, source)
    If match > maxMatch Then
        maxMatch = match
        matchedTag = Tag
    End If
Next

Debug.WriteLine("matched tag {0} matched % {1}", matchedTag, maxMatch)

The above code calls a method to determine the percentage the source string equals any valid tag.

Private Function checkMatch(ByVal tag As String, ByVal source As String) As Integer

        If tag = source Then Return 100


        Dim maxPercentage As Integer = 0

        For index As Integer = 0 To tag.Length - 1

            Dim tIndex As Integer = index
            Dim sIndex As Integer = 0
            Dim matchCounter As Integer = 0

            While True
                If tag(tIndex) = source(sIndex) Then
                    matchCounter += 1
                End If

                tIndex += 1
                sIndex += 1

                If tIndex + 1 > tag.Length OrElse sIndex + 1 > source.Length Then
                    Exit While
                End If
            End While

            Dim percentage As Integer = CInt(matchCounter * 100 / Math.Max(tag.Length, source.Length))
            If percentage > maxPercentage Then maxPercentage = percentage
        Next

        Return maxPercentage

    End Function

The above method, given a source string and a tag, finds the best match percentage comparing the single characters.

Given "widt" as input, it finds "width" as the best match with a 80% match value.

Upvotes: 1

Blam
Blam

Reputation: 2965

The parsing of the tag is the hardest part, seeing as you've done that, all you have to do now is loop through the elements, check them against an array of valid ones, if they aren't valid check them against an array of commonly misspelt items and replace/delete as necessary.

Someting similar to:

String[] ValidItems = {"alt", "src", "width", "height", "align", "border", "hspace", "longdesc", "vpace"};

Dictionary<String, String> MispeltItems = { {"al", "alt" } };

for(int i = ImgTagAttributes-1; i >= 0; i--)
{
    var element = ImgTagAttributes[i];
    if(!ValidItems.Contains(element))
    {
        if(MispeltItems.ContainsKey(element))
        {
            ImgTagElements.Replace(element, MispeltItems[element].Value);
            //Or use remove and insert.
        }
        else
        {
            ImgTagElements.RemoveAt(i);
        }
    }
}

(wrote this in stack overflow, if there's any errors just say, it's just so you can get a basic idea)

Upvotes: 1

Related Questions