Reputation: 853
I tried to use Jsoup to crawl this
Document dok = Jsoup.connect("http://bola.kompas.com/ligaindonesia").userAgent("Mozilla/5.0").timeout(0).get();
but error appeared like this:
java.io.IOException: Too many redirects occurred trying to load URL http://m.kompas.com/bola
And, when I type this:
Document dok = Jsoup.connect("http://m.kompas.com/bola").userAgent("Mozilla/5.0").timeout(0).get();
error appeared like this:
java.io.IOException: Too many redirects occurred trying to load URL http://bola.kompas.com
Actually this is my full code:
import java.io.IOException;
import org.jsoup.Connection;
import org.jsoup.HttpStatusException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class MainBackup {
public static void main(String[] args) throws IOException {
processCrawling_kompas("http://bola.kompas.com/ligaindonesia");
}
public static void processCrawling_kompas(String URL){
try{
Connection.Response response = Jsoup.connect(URL).timeout(0).execute();
int statusCode = response.statusCode();
if(statusCode == 200){
Document dok = Jsoup.connect(URL).userAgent("Mozilla/5.0").timeout(0).get();
System.out.println("opened page: "+ URL);
Elements nextPages = dok.select("a");
for(Element nextPage: nextPages){
if(nextPage != null){
if(nextPage.attr("href").contains("bola.kompas.com")){
processCrawling_kompas(nextPage.attr("abs:href"));
}
}
}
}
}catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (HttpStatusException e) {
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
What exactly happened in here? How to solve this?
Thanks for your help before :)
Upvotes: 1
Views: 1224
Reputation: 3457
Change the first line of your processCrawling_kompas
to this:
Connection.Response response = Jsoup.connect(URL).userAgent("Mozilla/5.0").timeout(0).execute();
The change is to add user agent! With this code I was able to get the following output:
opened page: https://login.kompas.com/act.php?do=ForgotPasswd&skin=default&sr=mykompas&done=http....
Upvotes: 1
Reputation: 11712
The idea to provide a userAgent is the right idea. If you do this also in the first call of Jsoup, it will work as expected.
Connection.Response response = Jsoup.connect(URL)
.userAgent("Mozilla/5.0")
.timeout(0).execute();
By the way - the response object already contains the full html, so you do not need to call connect again to get to the document. Try this:
String URL = "http://bola.kompas.com/ligaindonesia";
Connection.Response response = Jsoup.connect(URL)
.userAgent("Mozilla/5.0")
.timeout(0).execute();
int statusCode = response.statusCode();
if(statusCode == 200){
Document dok = Jsoup.parse(response.body(),URL);
System.out.println("opened page: "+ URL);
//your stuff
}
Upvotes: 4