Ivan
Ivan

Reputation: 1

Creating a Method in C# that traverses HTML document and extracts content based on a query, i.e Custom HTML Crawler

I created a program that parses each one of the elements of an HTML Document. It saves it in a Tree strcuture(unbalanced tree).

public class Attribute
{
    public string? Key { get; set; }
    public string? Value { get; set; }
    public Attribute() {}
    public Attribute(string key, string value)
    {
        this.Key = key;
        this.Value = value;
    }
}
public class Node 
{
    public string Tag { get; set; }
    public string Content { get; set; }
    public List<Attribute> Attributes { get; set; }
    public List<Node> Children { get; set; }
    public Node(string tag)
    {
        Tag = tag;
        Attributes = new List<Attribute>();
        Children = new List<Node>();
        Content = "";
    }
}

This is the tree node structure.


There were some constraints to the assignment such as prohibition of Dictionaries, Queues, Hash tables, LinkedList etc. except for List, as well as HtmlAgilityPack or any library. They could be used only if we create them ourselves. This is the reason behind using List and having distinct class for it.


Here is the example HTML Document on which i should perform queries:

<html>
<body>
    <p>Text1</p>
    <p>Text2</p>
    <p id='p3'>Text3</p>
    <div>
        <div>Text4</div>
        <p>Text5</p>
    </div>
    <table>
        <tr>
            <td>11</td>
        </tr>
        <tr>
            <td>22</td>
        </tr>
    </table>
    <table id='table2'>
        <tr>
            <td>33</td>
        </tr>
        <tr>
            <td>44</td>
        </tr>
    </table>
    <a href="http://https://www.w3schools.com" src="img_girl.bmp">w3schools</a>
    <img src="img_girl.bmp" />
</body>
</html>

The program should extract content from the HTML document based on queries: (These are all user inputs in the Console)

Basically, the program should be a simpler version of XPath.

This is my traversal function:

private List<string> Traverse(Node node, string[] segments, int index)
{

    List<string> matches = new List<string>();
    if (node == null) return matches;

    if (index + 1 >= segments.Length)
    {
        if (!string.IsNullOrEmpty(node.Content))
        {
            matches.Add(node.Content);
        }
        foreach (Node child in node.Children)
        {
            matches.AddRange(Traverse(child, segments, index));
        }
        return matches;
    }
    string segment = segments[index];

    if (segment.Contains("*"))
    {
        foreach (var child in node.Children)
        {
            matches.AddRange(Traverse(child, segments, index + 1));
        }
    }
    else if (segment.Contains("@"))
    {
        int attributeStart = segment.IndexOf('@');
        int attributeEnd = segment.IndexOf('=');
        if (attributeStart != -1 && attributeEnd != -1)
        {
            string attributeName = segment.Substring(attributeStart + 1, attributeEnd - attributeStart - 1);
            string attributeValue = segment.Substring(attributeEnd + 2, segment.Length - attributeEnd - 4);

            foreach (var child in node.Children)
            {
                if (child.Attributes.Any(attr => attr.Key == attributeName && attr.Value == attributeValue))
                {
                    matches.AddRange(Traverse(child, segments, index + 1));
                }
            }
        }
        else
        {
            Console.WriteLine("Invalid attribute format");
        }
    }
    else if (segment.Contains("["))
    {
        int indexStart = segment.IndexOf("[");
        int indexEnd = segment.IndexOf("]");
        if (indexStart != -1 && indexEnd != -1)
        {
            string indexValue = segment.Substring(indexStart + 1, indexEnd - indexStart - 1);
            if (int.TryParse(indexValue, out int indexNum))
            {
                if (indexNum >= 0 && indexNum < node.Children.Count)
                {
                    matches.AddRange(Traverse(node.Children[indexNum], segments, index + 1));
                }
                else
                {
                    Console.WriteLine($"Invalid index: {indexNum}");
                }
            }
            else
            {
                Console.WriteLine($"Invalid index value: {indexValue}");
            }
        }
        else
        {
            Console.WriteLine("Invalid index format");
        }
    }
    else
    {
        foreach (var child in node.Children)
        {

            if (segments[index + 1] == child.Tag)
            {
                matches.AddRange(Traverse(child, segments, index + 1));
            }
        }
    }
    return matches;
}

So at the last else statement it used to be segment instead of segments[index + 1]:

//...
foreach (var child in node.Children)
{
   if (segments[index + 1] == child.Tag)
    {
        matches.AddRange(Traverse(child, segments, index + 1));
    }
}
//...

This change fixed the problem when I write //html/body and it outputs nothing, whereas it should output every text content of the body.

But doing it this way does now allow me to search for specific elements such as //html/body/p[1].

When the function reaches the else statement and traverses through the children of the node (node.Children) segments[index + 1] does not actually equal Node child, so the next level does not get traversed properly. Then leaves the foreach loop and return matches gets executed with no content.

If somebody is able to understand my logic and help me fix the problem.

Upvotes: 0

Views: 42

Answers (0)

Related Questions