Reputation: 627
I would like to make a program that parses the html page and selects useful information and displays it. I did it by opening a stream and then line by line searching for this appropriate content, but this is a time consuming process. So then I decided to do it by treating it as a xml and then using xpath. This I did by making a xml file on my system and loading the contents from the stream, and I got white space error, then I decide to direct open document as
doc = (Document) builder.parse(inputStream);
but the same error still persists. After asking here I was suggested to use jSoup for html parsing, now when I execute my code for:
Document doc= Jsoup.connect(url).get();
I get Read timed out. The same program when made in python and using a naive strategy like using find method of string and searching, I am displayed the contents and that too fast. How to make it work fast in java?
Complete code:
import java.io.*;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Parser {
public static void main(String[] args) {
Validate.isTrue(true, "usage: supply url to fetch");
try{
String url="http://www.spoj.com/ranks/PRIME1/";
Document doc= Jsoup.connect(url).get();
Elements es=doc.getElementsByAttributeValue("class","lightrow");
System.out.println(es.get(0).child(0).text());
}catch(Exception e){e.printStackTrace();}
}
}
Exception:
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:412)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:393)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:159)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:148)
at Parser.main(Parser.java:12)
Upvotes: 0
Views: 660
Reputation: 25350
Does your firewall or OS block your request (maybe it blocks java access to internet)? Are you using PC or eg. Android? And is your HTML page a website or a (local) HTML file? Please post some more code or the exception you get.
Please make shure you dont use a DOM Document but org.jsoup.nodes.Document
.
I am displayed the contents
How do you want to display the content? If you simply need a value like this:
...
<div>some value</div>
...
You can do this with jsoup:
Document doc = ... // parse html file or connect to website
final String value = doc.select("div").first().text();
System.out.println(value);
Since the default connection timeout is 3 sec (3000 millis) it should be changed for big websites, because loading the data may take some time:
final String url = "http://www.spoj.com/ranks/PRIME1/";
final int timeout = 4000; // or higher
Document doc = Jsoup.connect(url).timeout(4000).get();
Upvotes: 1