Himanshu
Himanshu

Reputation: 1443

How to add proxy support to Jsoup?

I am a beginner to Java and my first task is to parse some 10,000 URLs and extract some info out of it, for this I am using Jsoup and it's working fine.

But now I want to add proxy support to it. The proxies have a username and password too.

Upvotes: 46

Views: 47974

Answers (7)

Stephan
Stephan

Reputation: 43053

Jsoup 1.9.1 and above: (recommended approach)

// Fetch url with proxy
Document doc = Jsoup //
               .connect("http://www.example.com/") //
               .proxy("127.0.0.1", 8080) // sets a HTTP proxy
               .userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
               .header("Content-Language", "en-US") //
               .get();

You may use also the overload Jsoup#proxy which takes a Proxy class (see below).

Before Jsoup 1.9.1: (verbose approach)

// Setup proxy
Proxy proxy = new Proxy(                                      //
        Proxy.Type.HTTP,                                      //
        InetSocketAddress.createUnresolved("127.0.0.1", 8080) //
);

// Fetch url with proxy
Document doc = Jsoup //
               .connect("http://www.example.com/") //
               .proxy(proxy) //
               .userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
               .header("Content-Language", "en-US") //
               .get();

References:

Upvotes: 54

Yusuf Ismail Oktay
Yusuf Ismail Oktay

Reputation: 875

You can easily set proxy

System.setProperty("http.proxyHost", "192.168.5.1");
System.setProperty("http.proxyPort", "1080");
Document doc = Jsoup.connect("www.google.com").get();

Upvotes: 70

juzraai
juzraai

Reputation: 5943

Jsoup does support using proxies, since v1.9.1. Connection class has the following methods:

  • proxy(Proxy p)
  • proxy(String host, int port)

Which you can use it like this:

Jsoup.connect("...url...").proxy("127.0.0.1", 8080);

If you need authentication, you can use the Authenticator approach mentioned by @Navneet Swaminathan or simply set system properties:

System.setProperty("http.proxyUser", "username");
System.setProperty("http.proxyPassword", "password");

or

System.setProperty("https.proxyUser", "username");
System.setProperty("https.proxyPassword", "password");

Upvotes: 3

Stephan
Stephan

Reputation: 43053

Try this code instead:

URL url = new URL("http://www.example.com/");
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("127.0.0.1", 8080)); // or whatever your proxy is

HttpURLConnection uc = (HttpURLConnection)url.openConnection(proxy);
hc.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
uc.setRequestProperty("Content-Language", "en-US");
uc.setRequestMethod("GET");
uc.connect();

Document doc = Jsoup.parse(uc.getInputStream());

Upvotes: 1

bitbyter
bitbyter

Reputation: 910

You might like to add this before running the program

final String authUser = "USERNAME";
final String authPassword = "PASSWORD";



Authenticator.setDefault(
               new Authenticator() {
                  public PasswordAuthentication getPasswordAuthentication() {
                     return new PasswordAuthentication(
                           authUser, authPassword.toCharArray());
                  }
               }
            );

..

System.setProperty("http.proxyHost", "192.168.5.1");
System.setProperty("http.proxyPort", "1080");
..

Upvotes: 6

Alex Shwarc
Alex Shwarc

Reputation: 885

System.setProperty("http.proxyHost", "192.168.5.1");
System.setProperty("http.proxyPort", "1080");
Document doc = Jsoup.connect("www.google.com").get();

This is wrong solution, because parsing is usually multithreaded and we usually need to change proxies. This code sets only one proxy for all threads. So better to not use Jsoup.Connection.

Upvotes: 5

Ryan
Ryan

Reputation: 882

You don't have to get the webpage data through Jsoup. Here's my solution, it may not be the best though.

  URL url = new URL("http://www.example.com/");
  Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("127.0.0.1", 8080)); // or whatever your proxy is
  HttpURLConnection uc = (HttpURLConnection)url.openConnection(proxy);

  uc.connect();

    String line = null;
    StringBuffer tmp = new StringBuffer();
    BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream()));
    while ((line = in.readLine()) != null) {
      tmp.append(line);
    }

    Document doc = Jsoup.parse(String.valueOf(tmp));

And there it is. This gets the source of the html page through a proxy and then parses it with Jsoup.

Upvotes: 40

Related Questions