Reputation: 42760
I would like to convert some HTML characters back to text using Java Standard Library. I was wondering whether any library would achieve my purpose?
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
// "Happy & Sad" in HTML form.
String s = "Happy & Sad";
System.out.println(s);
try {
// Change to "Happy & Sad". DOESN'T WORK!
s = java.net.URLDecoder.decode(s, "UTF-8");
System.out.println(s);
} catch (UnsupportedEncodingException ex) {
}
}
Upvotes: 43
Views: 128264
Reputation: 166
I think the Apache Commons Lang library's StringEscapeUtils.unescapeHtml3()
and unescapeHtml4()
methods are what you are looking for. See https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html.
Upvotes: 60
Reputation: 1
Or you can use unescapeHtml4:
String miCadena="GUÍA TELEFÓNICA";
System.out.println(StringEscapeUtils.unescapeHtml4(miCadena));
This code print the line: GUÍA TELEFÓNICA
Upvotes: 2
Reputation: 1
You can use the class org.apache.commons.lang.StringEscapeUtils:
String s = StringEscapeUtils.unescapeHtml("Happy & Sad")
It is working.
Upvotes: 4
Reputation: 831
As @jem suggested, it is possible to use jsoup.
With jSoup 1.8.3 it il possible to use the method Parser.unescapeEntities that retain the original html.
import org.jsoup.parser.Parser;
...
String html = Parser.unescapeEntities(original_html, false);
It seems that in some previous release this method is not present.
Upvotes: 1
Reputation: 41
Here you have to just add jar file in lib jsoup in your application and then use this code.
import org.jsoup.Jsoup;
public class Encoder {
public static void main(String args[]) {
String s = Jsoup.parse("<Français>").text();
System.out.print(s);
}
}
Link to download jsoup: http://jsoup.org/download
Upvotes: 29
Reputation: 335
The URL decoder should only be used for decoding strings from the urls generated by html forms which are in the "application/x-www-form-urlencoded" mime type. This does not support html characters.
After a search I found a Translate class within the HTML Parser library.
Upvotes: 5
Reputation: 29559
java.net.URLDecoder
deals only with the application/x-www-form-urlencoded
MIME format (e.g. "%20" represents space), not with HTML character entities. I don't think there's anything on the Java platform for that. You could write your own utility class to do the conversion, like this one.
Upvotes: 7
Reputation: 54104
I'm not aware of any way to do it using the standard library. But I do know and use this class that deals with html entities.
"HTMLEntities is an Open Source Java class that contains a collection of static methods (htmlentities, unhtmlentities, ...) to convert special and extended characters into HTML entitities and vice versa."
http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities
Upvotes: 2