Cheok Yan Cheng
Cheok Yan Cheng

Reputation: 42760

Convert HTML Character Back to Text Using Java Standard Library

I would like to convert some HTML characters back to text using Java Standard Library. I was wondering whether any library would achieve my purpose?

/**
 * @param args the command line arguments
 */
public static void main(String[] args) {
    // TODO code application logic here

    // "Happy & Sad" in HTML form.
    String s = "Happy & Sad";
    System.out.println(s);

    try {
        // Change to "Happy & Sad". DOESN'T WORK!
        s = java.net.URLDecoder.decode(s, "UTF-8");
        System.out.println(s);
    } catch (UnsupportedEncodingException ex) {

    }
}

Upvotes: 43

Views: 128264

Answers (8)

Bill.D
Bill.D

Reputation: 166

I think the Apache Commons Lang library's StringEscapeUtils.unescapeHtml3() and unescapeHtml4() methods are what you are looking for. See https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html.

Upvotes: 60

Or you can use unescapeHtml4:

    String miCadena="GUÍA TELEFÓNICA";
    System.out.println(StringEscapeUtils.unescapeHtml4(miCadena));

This code print the line: GUÍA TELEFÓNICA

Upvotes: 2

Bruno Barros
Bruno Barros

Reputation: 1

You can use the class org.apache.commons.lang.StringEscapeUtils:

String s = StringEscapeUtils.unescapeHtml("Happy & Sad")

It is working.

Upvotes: 4

Daniele
Daniele

Reputation: 831

As @jem suggested, it is possible to use jsoup.

With jSoup 1.8.3 it il possible to use the method Parser.unescapeEntities that retain the original html.

import org.jsoup.parser.Parser;
...
String html = Parser.unescapeEntities(original_html, false);

It seems that in some previous release this method is not present.

Upvotes: 1

jem
jem

Reputation: 41

Here you have to just add jar file in lib jsoup in your application and then use this code.

import org.jsoup.Jsoup;

public class Encoder {
    public static void main(String args[]) {
        String s = Jsoup.parse("<Français>").text();
        System.out.print(s);
    }
}

Link to download jsoup: http://jsoup.org/download

Upvotes: 29

Rich
Rich

Reputation: 335

The URL decoder should only be used for decoding strings from the urls generated by html forms which are in the "application/x-www-form-urlencoded" mime type. This does not support html characters.

After a search I found a Translate class within the HTML Parser library.

Upvotes: 5

Zach Scrivena
Zach Scrivena

Reputation: 29559

java.net.URLDecoder deals only with the application/x-www-form-urlencoded MIME format (e.g. "%20" represents space), not with HTML character entities. I don't think there's anything on the Java platform for that. You could write your own utility class to do the conversion, like this one.

Upvotes: 7

rogeriopvl
rogeriopvl

Reputation: 54104

I'm not aware of any way to do it using the standard library. But I do know and use this class that deals with html entities.

"HTMLEntities is an Open Source Java class that contains a collection of static methods (htmlentities, unhtmlentities, ...) to convert special and extended characters into HTML entitities and vice versa."

http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities

Upvotes: 2

Related Questions