Reputation: 53

Converting HTML to a Tree using Java

So I am trying to code a program which takes a file containing simple HTML syntax into a tree which would show the hierarchy of the tags. Ultimately, each leaf would contain a tag (ie. p, h, ul, etc) and text. Much of this is pretty simple and I am planning on using Jtree to show the final output. However, what I am having difficulty on is going through the syntax and building an in initial tree with the tags without losing the relations. What I am think is that the entire file would be one long string. The program will find a '<' where the second char is not a '/' and consider that an new tag/leaf. The code would then move on and check the next set of chars to see if there is another '<' which would indicate a child tag. If a '/' is found in the second char after the '<', then the code would move to the next leaf on the same level.

Hopefully, you get what I am trying to do, unfortunately, my attempt at it was less than successful as it only showed the child nodes of the root tag. Currently, I am only trying to get the tags to work in a tree, the text and what not I can figure out later. To test the code, I used a string "test" that has some basic sample html code, each of the nodes are shown within the root when the jtree is created, but the child nodes in node2 never shows up. I am so confused and cannot rap my head around this. Also, is there a more simpler/efficient way of doing this?

**EDIT: So I Modified the code to work using JSoup. I managed to get it to work, however, I am having an issue where for some reason, all but the first child tag of the head tag gets moved under the body take. So now body has 3 children instead of one and head only has one instead of three. Also, how would i modify the getChildren() recursive function to work for each child layer within the previous child? For example, to get the h3 tag within the title tag?

package weboqltree_converter;

import javax.swing.JFrame;
import javax.swing.JTree;
import javax.swing.SwingUtilities;
import javax.swing.tree.DefaultMutableTreeNode;
import java.util.ArrayList;
import java.awt.Dimension;
import java.util.List;
import javax.swing.tree.TreeNode;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;

public class GUI extends JFrame
{
    private JTree tree;
    private String test = "<html>"
            +   "<head>"
            +       "<title><h3>First parse<h3></title>"
            +       "<a></a>"
            +       "<h3></h3>"
            +   "</head>"
            +   "<body>"
            +       "<p>Parsed HTML into a doc.</p>"
            +   "</body>"
            + "</html>";

    private int parentNode;

    public static void main(String[] args)
    {
        SwingUtilities.invokeLater(new Runnable() {
            public void run() {
                new GUI();
            }
        });
    }

    public GUI()
    {
        DefaultMutableTreeNode html = new DefaultMutableTreeNode("html");
        Document doc = Jsoup.parse(test);
        int children = doc.childNodes().get(0).childNodes().size();
        for(int i=0; i < children; i++){
            String tag = doc.childNodes().get(0).childNodes().get(i).nodeName();
            String text = "N/A"; //doc.childNodes().get(0).childNodes().get(i).toString();

            html.add(new DefaultMutableTreeNode("Tag: " + tag+ ", Text: " + text));

            System.out.println(tag+" : "+doc.childNodes().get(0).childNodes().get(i).childNodeSize());

            if(doc.childNodes().get(0).childNodes().get(i).childNodeSize() > 0){
                getChildren(html.getLastLeaf(), doc.childNodes().get(0).childNodes().get(i),0, doc.childNodes().get(0).childNodes().get(i).childNodeSize());
            }
        }
        System.out.println("tag: " + children);           


        //System.out.println(Tree.get(2) +" "+Tree.get(2).getChildCount());
        tree = new JTree(html);
        add(tree);

        this.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
        this.setTitle("JTree Example"); 
        this.setMinimumSize(new Dimension(300, 400));
        this.setExtendedState(3);
        this.pack();
        this.setVisible(true);
    }

    public void getChildren(DefaultMutableTreeNode tree, Node doc, int start, int size){

        tree.add(new DefaultMutableTreeNode("Tag: " + doc.childNodes().get(start).nodeName()));
        start++;

        if(start < size){
            getChildren(tree, doc, start, size);
        }

    }
}

Upvotes: 3

Answers (2)

Stefan

Reputation: 12453

You can use JSoup to do that. It reads a String, a file or URL and parses it into a Document object, (which it does very fast). After that you can navigate the object and create a JTree from it.

String html = "<html><head><title>First parse</title></head><body><p>Parsed HTML into a doc.</p></body></html>";
Document document = Jsoup.parse(html);

Update

I have changed your code to use a recursive method. Because there might be more than one root node in the document (usually "document"-tag and "html"-tag) its a good idea to add a default root node. Have a look:

public GUI() {
    // create window
    this.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
    this.setTitle("JTree Example");
    this.setMinimumSize(new Dimension(300, 400));
    this.setExtendedState(3);

    // create tree and root node
    this.tree = new JTree();
    final DefaultMutableTreeNode ROOT = new DefaultMutableTreeNode("Html Document");

    // create model
    DefaultTreeModel treeModel = new DefaultTreeModel(ROOT);
    tree.setModel(treeModel);

    // add scrolling tree to window
    this.add(new JScrollPane(tree));

    // parse document (can be cleaned too)
    Document doc = Jsoup.parse(test);
    // Cleaner cleaner = new Cleaner(Whitelist.simpleText());
    // doc = cleaner.clean(doc);

    // walk the document tree recursivly
    traverseRecursivly(doc.getAllElements().first(), ROOT);

    this.expandAllNodes(tree);
    this.pack();
    this.setLocationRelativeTo(null);
    this.setVisible(true);
}

private void traverseRecursivly(Node docNode, DefaultMutableTreeNode treeNode) {
    // iterate child nodes:
    for (Node nextChildDocNode : docNode.childNodes()) {
        // create leaf:
        DefaultMutableTreeNode nextChildTreeNode = new DefaultMutableTreeNode(nextChildDocNode.nodeName());
        // add child to tree:
        treeNode.add(nextChildTreeNode);
        // do the same for this child's child nodes:
        traverseRecursivly(nextChildDocNode, nextChildTreeNode);
    }
}

// can be removed ...
private void expandAllNodes(JTree tree) {
    int j = tree.getRowCount();
    int i = 0;
    while (i < j) {
        tree.expandRow(i);
        i += 1;
        j = tree.getRowCount();
    }
}

Upvotes: 5

GhostCat

Reputation: 140417

Sorry, but this is wrong on many levels.

First of all, parsing html/xml isn't easy. And your current code to get their is way too naive. Instead of doing something like this yourself, you better try to use some existing library to do that parsing stuff for you. Getting that right will be hard enough for you already. (chances that you do complete parsing in a correct and robust way are close to zero though)

Then: instead of focusing on "complex" tasks ... I would rather suggest that you focus on some craftsmanship aspects of programming first. For example: your code up there is pretty much untestable (as it is doing everything within that poor constructor method). It is also much harder to read than it ought to be.

My (personal) recommendation:

Learn about writing testable code (see here)
Learn about using TDD and doing unit tests with JUnit
Learn about "Clean code" by reading that book by Robert Martin

In other words: it seems that you want to spent your efforts on solving complex problems. But in order to do that in an efficient, enduring way ... you are lacking very basic skills. It doesn't do much good when you write code that solves some problem ... when that code is of bad quality! I know, that doesn't sound like much "fun"; but believe me: doing TDD "the right way" is an extremely rewarding activity!

Upvotes: 1

Converting HTML to a Tree using Java

Answers (2)

Related Questions