Crayl
Crayl

Reputation: 1911

How to parse this site with Jsoup or another parser?

I'am trying to parse a page which has no defined encoding in its header, in the HTML it defines ISO-8859-1 as encoding. Jsoup isn't able to parse it with default settings (also HTMLunit and PHP's Simple HTML Dom Parser can't handle it by default). Even if I define the encoding for Jsoup myself it still isn't working. Can't figure out why.

Here's my code:

    String url = "http://www.parkett.de";
    Document doc = null;
    try {
         doc = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url);
        // doc = Jsoup.parse(new URL(url).openStream(), "CP1252", url);
    } catch (IOException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }

    Element extractHtml = null;
    Elements elements = null;
    String title = null;
    elements = doc.select("h1");
    if(!elements.isEmpty()) {
        extractHtml = elements.get(0);
        title = extractHtml.text();
    }
    System.out.println(title);

Thanks for any suggestions!

Upvotes: 1

Views: 298

Answers (1)

Richard Krajunus
Richard Krajunus

Reputation: 809

When working with URLs, chapters 4 & 9 of the cookbook recommend using Jsoup.connect(...).get(). Chapter 5 suggests using Jsoup.parse() when loading a document from a local file.

public static void main(String[] args) {

    Document doc = null;

    try {
        doc = Jsoup.connect("http://www.parkett.de/").get();
    } catch (IOException e) {
        e.printStackTrace();
    }

    Element firstH1 = doc.select("h1").first();

    System.out.println((firstH1 != null) ? firstH1.text() : "First <h1> not found.");
}

Upvotes: 1

Related Questions