DavidNg
DavidNg

Reputation: 2836

Java: download web content from Google https

I am trying to download the web content from Google https as in the link below.

link to download

With the code below, I first disable the validation of certificates for testing purposes and trust all certificates, and then download the web as regular http, but for some reason, it is not successful:

public static void downloadWeb() {
        // Create a new trust manager that trust all certificates
        TrustManager[] trustAllCerts = new TrustManager[] { new X509TrustManager() {
            public java.security.cert.X509Certificate[] getAcceptedIssuers() {
                return null;
            }

            public void checkClientTrusted(
                    java.security.cert.X509Certificate[] certs, String authType) {
            }

            public void checkServerTrusted(
                    java.security.cert.X509Certificate[] certs, String authType) {
            }
        } };

    // Activate the new trust manager
        try {
            SSLContext sc = SSLContext.getInstance("SSL");
            sc.init(null, trustAllCerts, new java.security.SecureRandom());
            HttpsURLConnection
                    .setDefaultSSLSocketFactory(sc.getSocketFactory());
        } catch (Exception e) {}

            //begin download as regular http
        try {
            String wordAddress = "https://www.google.com/webhp?hl=en&tab=ww#hl=en&tbs=dfn:1&sa=X&ei=obxCUKm7Ic3GqAGvoYGIBQ&ved=0CDAQBSgA&q=pronunciation&spell=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&fp=c5bfe0fbd78a3271&biw=1024&bih=759";
            URLConnection yc = new URL(wordAddress).openConnection();
            BufferedReader in = new BufferedReader(new InputStreamReader(
                    yc.getInputStream()));
            String inputLine = "";
            while ((inputLine = in.readLine()) != null) {
                System.out.println(wordAddress);
            }

        } catch (IOException e) {}

    }

Upvotes: 1

Views: 560

Answers (1)

gigadot
gigadot

Reputation: 8969

You need to fake HTTP headers so that google think that you are downloading it from a web browser. Here is a sample code using HttpClient:

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;

public class App1 {

    public static void main(String[] args) throws IOException {
        HttpClient httpclient = new DefaultHttpClient();
        HttpGet httpget = new HttpGet("http://_google_url_");
        httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0");
        HttpResponse execute = httpclient.execute(httpget);
        File file = new File("google.html");
        FileOutputStream fout = null;
        try {
            fout = new FileOutputStream(file);
            execute.getEntity().writeTo(fout);
        } finally {
            if (fout != null) {
                fout.close();
            }
        }
    }
}

Warning, I am not responsible if you use this code and violate Google's term of service agreement.

Upvotes: 1

Related Questions