Reputation: 1545
I have written a test web crawler class that attempts to search Google, as shown:
public class WebCrawler {
String query;
public WebCrawler(String search)
{
query = search;
}
public void connect()
{
HttpURLConnection connection = null;
try
{
String url = "http://www.google.com/search?q=" + query;
URL search = new URL(url);
connection = (HttpURLConnection)search.openConnection();
connection.setRequestMethod("GET");
connection.setDoOutput(true);
connection.setDoInput(true);
connection.setUseCaches(false);
connection.setAllowUserInteraction(false);
connection.connect();
BufferedReader read = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line = null;
while((line = read.readLine())!=null)
{
System.out.println(line);
}
read.close();
}
catch(MalformedURLException e)
{
e.printStackTrace();
}
catch(ProtocolException e)
{
e.printStackTrace();
}
catch(IOException e)
{
e.printStackTrace();
}
finally
{
connection.disconnect();
}
}
}
When I try to run it with a test query "test" though, I get a HTTP response 403 error-- what am I missing? This is my first time doing any networking stuff with Java.
Upvotes: 0
Views: 274
Reputation: 25974
403 == forbidden, which makes sense because you're a robot trying to access a part of google that they don't want robots accessing. Google's robots.txt pretty clearly specifies that you shouldn't be scraping /search.
Google provides a search API which allows 100 queries per day. They provide libraries and examples of how to interface with it in most languages, including Java. More than that, you've gotta pay.
Upvotes: 1