Modifying HTML in Memory with JSoup

Question

Recently I was recommended to use JSoup to parse and modify HTML documents.

However what if I have a HTML document that I want to modify (to send, store somewhere else, etc.), how might I go about doing that without changing the original document?

Say I have an HTML file like so:


 
      
  
  Title: title
  
  Name: 
  Address: 
  Phone Number:

And I want to fill in the appropriate data for Name, Address, Phone Number and any other information I'd like, without modifying the original HTML file, how might I go about that using JSoup?

signus · Accepted Answer

@MarcoS had an excellent solution using a NodeTraversor to make a list of nodes to change at https://stackoverflow.com/a/6594828/1861357 and I only very slightly modified his method which replaces a node (a set of tags) with the data in the node plus whatever information you would like to add.

To store a String in memory I used a static StringBuilder to save the HTML in memory.

First we read in the HTML file (that is manually specified, this can be changed), then we make a series of checks to change whatever nodes with any data that we want.

The one problem that I didn't fix in the solution by MarcoS was that it split each individual word, instead of looking at a line. However I just used '-' for multiple words, because otherwise it places the string directly after that word.

So a full implementation:

import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.*;

public class memoryHTML
{
    static String htmlLocation = "C:\Users\User\";               
    static String fileName = "blah";                            // Just for demonstration, easily modified.
    static StringBuilder buildTmpHTML = new StringBuilder();
    static StringBuilder buildHTML = new StringBuilder();
    static String name = "John Doe";
    static String address = "42 University Dr., Somewhere, Someplace";
    static String phoneNumber = "(123) 456-7890";

    public static void main(String[] args)
    {
        // You can send it the full path with the filename. I split them up because I used this for multiple files.
        readHTML(htmlLocation, fileName);
        modifyHTML();

        System.out.println(buildHTML.toString());

        // You need to clear the StringBuilder Object or it will remain in memory and build on each run.
        buildTmpHTML.setLength(0);
        buildHTML.setLength(0);

        System.exit(0);
    }

    // Simply parse and build a StringBuilder for a temporary HTML file that will be modified in modifyHTML()
    public static void readHTML(String directory, String fileName)
    {
        try
        {
            BufferedReader br = new BufferedReader(new FileReader(directory + fileName + ".html"));

            String line;
            while((line = br.readLine()) != null)
            {
                buildTmpHTML.append(line);
            }
            br.close();
        }
        catch (Exception e)
        {
            e.printStackTrace();
            System.exit(1);
        }
    }

    // Excellent method of parsing and modifying nodes in HTML files by @MarcoS at https://stackoverflow.com/a/6594828/1861357
    // It has its small problems, but it does the trick.
    public static void modifyHTML()
    {
        String htmld = buildTmpHTML.toString();
        Document doc = Jsoup.parse(htmld);

        final List nodesToChange = new ArrayList();

        NodeTraversor nd  = new NodeTraversor(new NodeVisitor() 
        {
          @Override
          public void tail(Node node, int depth) 
          {
            if (node instanceof TextNode) 
            {
              TextNode textNode = (TextNode) node;
              nodesToChange.add(textNode);
            }
          }

          @Override
          public void head(Node node, int depth) 
          {        
          }
        });

        nd.traverse(doc.body());

        for (TextNode textNode : nodesToChange) 
        {
          Node newNode = buildElementForText(textNode);
          textNode.replaceWith(newNode);
        }

        buildHTML.append(doc.html());
    }

    private static Node buildElementForText(TextNode textNode) 
      {
        String text = textNode.getWholeText();
        String[] words = text.trim().split(" ");
        Set units = new HashSet();
        for (String word : words) 
            units.add(word);

        String newText = text;
        for (String rpl : units) 
        {
            if(rpl.contains("Name"))
                newText = newText.replaceAll(rpl, "" + rpl + " " + name:));
            if(rpl.contains("Address") || rpl.contains("Residence"))
                newText = newText.replaceAll(rpl, "" + rpl + " " + address);
            if(rpl.contains("Phone-Number") || rpl.contains("PhoneNumber"))
                newText = newText.replaceAll(rpl, "" + rpl + " " + phoneNumber);
        }
        return new DataNode(newText, textNode.baseUri());
      }

And you'll get this HTML back (remember I changed "Phone Number" to "Phone-Number"):


 
      
  
  Title: title
  
  Name: John Doe 
  Address: 42 University Dr., Somewhere, Someplace
  Phone-Number: (123) 456-7890

Modifying HTML in Memory with JSoup

Answers (2)

Related Questions