Kennedy Kan
Kennedy Kan

Reputation: 383

How to combine Http header and read content JAVA program?

And I get a program which should be used to get content for html.

public class University {
    public static void main(String[] args) throws Exception {
        System.out.println("Started");

        URL url = new URL ("http://www.4icu.org/reviews/index2.htm");

        URLConnection spoof = url.openConnection();        
        // Spoof the connection so we look like a web browser
        spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");

        String connect = url.toString();
        Document doc = Jsoup.connect(connect).get();

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();

        while (iterator.hasNext()) {
            Element cell = iterator.next();
            String university = cell.select("a").text();
            String country = cell.nextElementSibling().select("img").attr("alt");

            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}

However, there seems to have Http header in blocking to reach the content. Thus, I have created the following program to get the header of the html site.

public class Get_Header {
  public static void main(String[] args) throws Exception {
    URL url = new URL("http://www.4icu.org/reviews/index2.htm");
    URLConnection connection = url.openConnection();

    Map responseMap = connection.getHeaderFields();
    for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();) {
      String key = (String) iterator.next();
      System.out.println(key + " = ");

      List values = (List) responseMap.get(key);
      for (int i = 0; i < values.size(); i++) {
        Object o = values.get(i);
        System.out.println(o + ", ");
      }
    }
  }
}

It retunrs the following result.

X-Frame-Options = 
SAMEORIGIN, 
Transfer-Encoding = 
chunked, 
null = 
HTTP/1.1 403 Forbidden, 
CF-RAY = 
2ca61c7a769b1980-HKG, 
Server = 
cloudflare-nginx, 
Cache-Control = 
max-age=10, 
Connection = 
keep-alive, 
Set-Cookie = 
__cfduid=d4f8d740e0ae0dd551be15e031359844d1469853403; expires=Sun, 30-Jul-17 04:36:43 GMT; path=/; domain=.4icu.org; HttpOnly, 
Expires = 
Sat, 30 Jul 2016 04:36:53 GMT, 
Date = 
Sat, 30 Jul 2016 04:36:43 GMT, 
Content-Type = 
text/html; charset=UTF-8, 

Though I can get the header, but how should I combine the code to form a complete one?

Great Thanks in Advnace.

Upvotes: 1

Views: 891

Answers (2)

ebo
ebo

Reputation: 2747

The "User-Agent" property which you set on the URL seems to be lost when you convert it back to a String again.

Setting the user-agent on the JSoup connection seems to work:

public static void main(String[] args) throws Exception {
    System.out.println("Started");

    String url = "http://www.4icu.org/reviews/index2.htm";
    Document doc = Jsoup.connect(url).userAgent("Mozilla").get();

    Elements cells = doc.select("td.i");

    Iterator<Element> iterator = cells.iterator();

    while (iterator.hasNext()) {
        Element cell = iterator.next();
        String university = cell.select("a").text();
        String country = cell.nextElementSibling().select("img").attr("alt");

        System.out.printf("country : %s, university : %s %n", country, university);
    }
}

Upvotes: 1

TDG
TDG

Reputation: 6171

You can use the Response class to get the page you need, use it to display the headers and then convert it to Document to extract the text you need:

Connection.Response response = Jsoup.connect("http://www.4icu.org/reviews/index2.htm")
            .userAgent("Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)")
            .method(Connection.Method.GET)
            .followRedirects(false)
            .execute();

Document doc = response.parse();
Elements cells = doc.select("td.i");
Iterator<Element> iterator = cells.iterator();

while (iterator.hasNext()) {
    Element cell = iterator.next();
    String university = cell.select("a").text();
    String country = cell.nextElementSibling().select("img").attr("alt");
    System.out.printf("country : %s, university : %s %n", country, university);
}
System.out.println(response.headers());

Upvotes: 1

Related Questions