Reputation: 2012
Hi Im attempting to crawl google search results, just for my own learning, but also to see can I speed up getting access to direct URLS (Im aware of their API but I just thought Id try this for now).
It was working fine but it seems to have stopped, its simply returning nothing now, Im unsure if its something I did, but I can say that I had this in a for loop to allow the start
parameter to increase and Im wondering may that have caused problems.
Is it possible Google can block an IP from crawling?
Thanks..
$url = "https://www.google.ie/search?q=adrian+de+cleir&start=1&ie=utf-8&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&channel=fflb&gws_rd=cr&ei=D730U7KgGfDT7AbNpoBY#channel=fflb&q=adrian+de+cleir&rls=org.mozilla:en-US:official";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('h3') as $link) {
$actual_link = $link->getElementsbyTagName('a');
foreach ($actual_link as $single_link) {
# Show the <a href>
echo '<pre>';
print_r($single_link->getAttribute('href'));
echo '</pre>';
}
}
Upvotes: 1
Views: 3129
Reputation: 166813
You can check if Google blocked you by the following simple curl script command:
curl -sSLA Mozilla "http://www.google.com/search?q=linux" | html2text -width 80
You may install html2text
in order to convert html into plain text.
Normally you should use Custom Search API provided by Google to avoid any limitations, so you could retrieve search results in easier way by having access to different formats (such as XML or JSON).
Upvotes: 0
Reputation: 171
Given below is the program I have written in python. But it is not completed fully. Right now it only gets the first page and prints all the href links found on the result.
We can use sets and remove the redundant links from the result set.
import requests<br>
from bs4 import BeautifulSoup
def search_spider(max_pages, search_string):
page = 0
search_string = search_string.replace(' ','+')
while page <= max_pages:
url = 'https://www.google.com/search?num=10000&q=' + search_string + '#q=' + search_string + '&start=' + str(page)
print("URL to search - " + url)
source_code = requests.get(url)
count = 1
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll("a", {"class" : ""}):
href = link.get('href')
input_string = slice_string(href)
print(input_string)
count += 1
page += 10
def slice_string(input_string):
input_string = input_string.lstrip("/url?q=")
index_c = input_string.find('&')
input_string = input_string[:index_c]
return input_string
search_spider(1,"bangalore cabs")
This program will search for bangalore cabs in google.
Thanks,
Karan
Upvotes: 1