Luiserebii
Luiserebii

Reputation: 55

JSoup seems to ignore character codes?

I'm building a small CMS-like application in Java, that takes a .txt file with shirt names/descriptions and loads the names/descriptions into an ArrayList of customShirts (small class I made). Then, it iterates through the ArrayList, and uses JSoup to parse a template (template.html) and insert the unique details of the shirt into the HTML. Finally, it pumps out each shirt into its own HTML file in an output folder.

When the descriptions are loaded into the ArrayList of customShirts, I replace all special characters with the appropriate character codes so they can be properly displayed (for example, replacing apostrophes with ’). The issue is, I've noticed that JSoup seems to automatically turn the character codes into the actual character, which is an issue, since I need the output to be displayable (which requires character codes). Is there anything I can do to fix this? I've looked at other workarounds, like at: Jsoup unescapes special characters, but they seem to require parsing the file before insertion with replaceAll, and I insert the character-code sensitive text with JSoup, which doesn't seem to make this an option.

Below is the code for the HTML generator I made:

public void generateShirtHTML(){

    for(int i = 0; i < arrShirts.size(); i++){

        File input = new File("html/template/template.html");
        Document doc = null;
        try {
            doc = Jsoup.parse(input, "UTF-8", "");
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
            }

        Element title = doc.select("title").first();
        title.append(arrShirts.get(i).nameToCapitalized());

        Element headingTitle = doc.select("h1#headingTitle").first();
        headingTitle.html(arrShirts.get(i).nameToCapitalized());

        Element shirtDisplay = doc.select("p#alt1").first();
        shirtDisplay.html(arrShirts.get(i).name);

        Element descriptionBox = doc.select("div#descriptionbox p").first();
        descriptionBox.html(arrShirts.get(i).desc);
        System.out.println(arrShirts.get(i).desc);

        PrintWriter output;
        try {
            output = new PrintWriter("html/output/" + arrShirts.get(i).URL);
            output.println(doc.outerHtml());
            //System.out.println(doc.outerHtml());
            output.close();
        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        System.out.println("Shirt " + i + " HTML generated!");

    }

}

Thanks in advance!

Upvotes: 2

Views: 1083

Answers (1)

Jonas Czech
Jonas Czech

Reputation: 12328

Expanding a little on my comment (since Stephan encouraged me..), you can use

doc.outputSettings().escapeMode(Entities.EscapeMode.extended);

To tell Jsoup to escape / encode special characters in the output, eg. left double quotes () as &ldquo;. To make Jsoup encode all special characters, you may also need to add

doc.outputSettings().charset("ASCII");

In order to ensure that all Unicode special characters will be HTML encoded.

For larger projects where you have to fill in data into HTML files, you can look at using a template engine such as Thymeleaf - it's easier to use for this kind of job (less code and such), and it offers many more features specifically for this process. For small projects (like yours), Jsoup is good (I've used it like this in the past), but for bigger (or even small) projects, you'll want to look into some more specialized tools.

Upvotes: 3

Related Questions