Reputation: 41
I am using JSOUB to scrape all the web page as the following:
public static final String GOOGLE_SEARCH_URL = "https://www.google.com/search";
String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num +
"&start=" + start;
Document doc = Jsoup.connect(searchURL)
.userAgent("Mozilla/5.0 Chrome/26.0.1410.64 Safari/537.31")
// .ignoreHttpErrors(true)
.maxBodySize(1024*1024*3)
.followRedirects(true)
.timeout(100000)
.ignoreContentType(true)
.get();
Elements results = doc.select("h3.r > a");
for (Element result : results) {
String linkHref = result.attr("href");
}
But my problem is that at the start of the code working good.
after a while, it will stop and always gives me " HTTP error fetching URL. Status=503 error".
when I add the .ignoreHttpErrors(true) it will work without any error but it will not scrape the web.
*search term is any keyword I want to search about and num is the number of pages that I need to retrieve.
could anyone help, please? Is this mean that Google blocked my IP from scraping? if yes is there any solution or how I scape the google search result, please?
I need help. Thank you,
Upvotes: 1
Views: 2698
Reputation: 774
503 error usually means the website you trying to scrap blocks you because they don't want non-human user navigating their sites. Especially Google.
There are something you can do though. Such as
Basically you need to be as human as possible to prevent sites blocking you.
EDIT:
I need to warn you that scraping Google search result is against their ToS and might be illegal depends on where you are.
What you can do
You can use proxy rotating service to mask your request so google will see it as request from multiple region. Google proxy rotator service
if you interested. It might be expensive depends on what you do with the data.
Then code some module that change the User-Agent
every request to make Google less suspicious with your request.
Add random delay after scraping each page. I suggest around 1-5 seconds. Randomized delay makes your request more human-like for Google
At last if everything fails, you might want to look into Google search API and use their API instead of scraping their site.
Upvotes: 1