Reputation: 383
And I get a program which should be used to get content for html.
public class University {
public static void main(String[] args) throws Exception {
System.out.println("Started");
URL url = new URL ("http://www.4icu.org/reviews/index2.htm");
URLConnection spoof = url.openConnection();
// Spoof the connection so we look like a web browser
spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");
String connect = url.toString();
Document doc = Jsoup.connect(connect).get();
Elements cells = doc.select("td.i");
Iterator<Element> iterator = cells.iterator();
while (iterator.hasNext()) {
Element cell = iterator.next();
String university = cell.select("a").text();
String country = cell.nextElementSibling().select("img").attr("alt");
System.out.printf("country : %s, university : %s %n", country, university);
}
}
}
However, there seems to have Http header in blocking to reach the content. Thus, I have created the following program to get the header of the html site.
public class Get_Header {
public static void main(String[] args) throws Exception {
URL url = new URL("http://www.4icu.org/reviews/index2.htm");
URLConnection connection = url.openConnection();
Map responseMap = connection.getHeaderFields();
for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();) {
String key = (String) iterator.next();
System.out.println(key + " = ");
List values = (List) responseMap.get(key);
for (int i = 0; i < values.size(); i++) {
Object o = values.get(i);
System.out.println(o + ", ");
}
}
}
}
It retunrs the following result.
X-Frame-Options =
SAMEORIGIN,
Transfer-Encoding =
chunked,
null =
HTTP/1.1 403 Forbidden,
CF-RAY =
2ca61c7a769b1980-HKG,
Server =
cloudflare-nginx,
Cache-Control =
max-age=10,
Connection =
keep-alive,
Set-Cookie =
__cfduid=d4f8d740e0ae0dd551be15e031359844d1469853403; expires=Sun, 30-Jul-17 04:36:43 GMT; path=/; domain=.4icu.org; HttpOnly,
Expires =
Sat, 30 Jul 2016 04:36:53 GMT,
Date =
Sat, 30 Jul 2016 04:36:43 GMT,
Content-Type =
text/html; charset=UTF-8,
Though I can get the header, but how should I combine the code to form a complete one?
Great Thanks in Advnace.
Upvotes: 1
Views: 891
Reputation: 2747
The "User-Agent"
property which you set on the URL seems to be lost when you convert it back to a String
again.
Setting the user-agent on the JSoup connection seems to work:
public static void main(String[] args) throws Exception {
System.out.println("Started");
String url = "http://www.4icu.org/reviews/index2.htm";
Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
Elements cells = doc.select("td.i");
Iterator<Element> iterator = cells.iterator();
while (iterator.hasNext()) {
Element cell = iterator.next();
String university = cell.select("a").text();
String country = cell.nextElementSibling().select("img").attr("alt");
System.out.printf("country : %s, university : %s %n", country, university);
}
}
Upvotes: 1
Reputation: 6171
You can use the Response
class to get the page you need, use it to display the headers and then convert it to Document
to extract the text you need:
Connection.Response response = Jsoup.connect("http://www.4icu.org/reviews/index2.htm")
.userAgent("Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)")
.method(Connection.Method.GET)
.followRedirects(false)
.execute();
Document doc = response.parse();
Elements cells = doc.select("td.i");
Iterator<Element> iterator = cells.iterator();
while (iterator.hasNext()) {
Element cell = iterator.next();
String university = cell.select("a").text();
String country = cell.nextElementSibling().select("img").attr("alt");
System.out.printf("country : %s, university : %s %n", country, university);
}
System.out.println(response.headers());
Upvotes: 1