Adrian
Adrian

Reputation: 2012

Crawling google search results with PHP Curl , was working but seems to have stopped

Hi Im attempting to crawl google search results, just for my own learning, but also to see can I speed up getting access to direct URLS (Im aware of their API but I just thought Id try this for now).

It was working fine but it seems to have stopped, its simply returning nothing now, Im unsure if its something I did, but I can say that I had this in a for loop to allow the start parameter to increase and Im wondering may that have caused problems.

Is it possible Google can block an IP from crawling?

Thanks..

$url = "https://www.google.ie/search?q=adrian+de+cleir&start=1&ie=utf-8&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&channel=fflb&gws_rd=cr&ei=D730U7KgGfDT7AbNpoBY#channel=fflb&q=adrian+de+cleir&rls=org.mozilla:en-US:official";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);

# Create a DOM parser object
$dom = new DOMDocument();

# Parse the HTML from Google.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);

# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('h3') as $link) {

        $actual_link = $link->getElementsbyTagName('a');
        foreach ($actual_link as $single_link) {
        # Show the <a href>
        echo '<pre>';
        print_r($single_link->getAttribute('href'));
        echo '</pre>';      


}
}

Upvotes: 1

Views: 3129

Answers (2)

kenorb
kenorb

Reputation: 166813

You can check if Google blocked you by the following simple curl script command:

curl -sSLA Mozilla "http://www.google.com/search?q=linux" | html2text -width 80

You may install html2text in order to convert html into plain text.

Normally you should use Custom Search API provided by Google to avoid any limitations, so you could retrieve search results in easier way by having access to different formats (such as XML or JSON).

Upvotes: 0

Ram Narayan
Ram Narayan

Reputation: 171

Given below is the program I have written in python. But it is not completed fully. Right now it only gets the first page and prints all the href links found on the result.

We can use sets and remove the redundant links from the result set.

import requests<br>
from bs4 import BeautifulSoup


def search_spider(max_pages, search_string):    
    page = 0   
    search_string = search_string.replace(' ','+')   
    while page <= max_pages:  
    url = 'https://www.google.com/search?num=10000&q=' + search_string + '#q=' + search_string + '&start=' + str(page)   
    print("URL to search - " + url)   
    source_code = requests.get(url)   
    count = 1   
    plain_text = source_code.text   
    soup = BeautifulSoup(plain_text)   
    for link in soup.findAll("a", {"class" : ""}):   
        href = link.get('href')   
        input_string = slice_string(href)   
        print(input_string)   
        count += 1   
    page += 10    


def slice_string(input_string):   
    input_string = input_string.lstrip("/url?q=")   
    index_c = input_string.find('&')   
    input_string = input_string[:index_c]   
    return input_string   

search_spider(1,"bangalore cabs")   

This program will search for bangalore cabs in google.

Thanks,
Karan

Upvotes: 1

Related Questions