Reputation: 2836
I am trying to download the web content from Google https as in the link below.
With the code below, I first disable the validation of certificates for testing purposes and trust all certificates, and then download the web as regular http, but for some reason, it is not successful:
public static void downloadWeb() {
// Create a new trust manager that trust all certificates
TrustManager[] trustAllCerts = new TrustManager[] { new X509TrustManager() {
public java.security.cert.X509Certificate[] getAcceptedIssuers() {
return null;
}
public void checkClientTrusted(
java.security.cert.X509Certificate[] certs, String authType) {
}
public void checkServerTrusted(
java.security.cert.X509Certificate[] certs, String authType) {
}
} };
// Activate the new trust manager
try {
SSLContext sc = SSLContext.getInstance("SSL");
sc.init(null, trustAllCerts, new java.security.SecureRandom());
HttpsURLConnection
.setDefaultSSLSocketFactory(sc.getSocketFactory());
} catch (Exception e) {}
//begin download as regular http
try {
String wordAddress = "https://www.google.com/webhp?hl=en&tab=ww#hl=en&tbs=dfn:1&sa=X&ei=obxCUKm7Ic3GqAGvoYGIBQ&ved=0CDAQBSgA&q=pronunciation&spell=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&fp=c5bfe0fbd78a3271&biw=1024&bih=759";
URLConnection yc = new URL(wordAddress).openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine = "";
while ((inputLine = in.readLine()) != null) {
System.out.println(wordAddress);
}
} catch (IOException e) {}
}
Upvotes: 1
Views: 560
Reputation: 8969
You need to fake HTTP headers so that google think that you are downloading it from a web browser. Here is a sample code using HttpClient:
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
public class App1 {
public static void main(String[] args) throws IOException {
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://_google_url_");
httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0");
HttpResponse execute = httpclient.execute(httpget);
File file = new File("google.html");
FileOutputStream fout = null;
try {
fout = new FileOutputStream(file);
execute.getEntity().writeTo(fout);
} finally {
if (fout != null) {
fout.close();
}
}
}
}
Warning, I am not responsible if you use this code and violate Google's term of service agreement.
Upvotes: 1