Dario
Dario

Reputation: 145

Query google search engine?

I am trying to query the google search engine by date to get the first page results then process it. The query I am currently using returns results but not in the date range I set; if I copied the same query into google it works for the date but not from my PHP script. The script returns only current or normal results as if the date parameter was not set. part of the code snippet used is below. The query I am referring to is below as well as in the code snippet posted in the $url variable.

Query:https://www.google.com/search?q='.$Query.'&source=lnt&tbs=cdr%3A1%2'.$startDate.$EndDate.'&tbm=

$Query= $_POST['Query'];
$Query=str_replace(" ","+",$Query);
if ($_POST['Start_date']==''){
$startday='1';
$startmonth='11';
$startyear='2011';
}
if ($_POST['End_date']==''){
$endday='1';
$endmonth='11';
$endyear='2013';
}
$startDate='Ccd_min%3A'.$startmonth.'%2F'.$startday.'%2F'.$startyear.'.%2';
$EndDate='Ccd_max%3A'.$endmonth.'%2F'.$endday.'%2F'.$endyear.'';

if ($_POST['Query']!=''){
$url  = 'https://www.google.com/search?   
q='.$Query.'&source=lnt&tbs=cdr%3A1%2'.$startDate.$EndDate.'&tbm=';
echo $url .'<p>';
$html = file_get_html($url);
$searchresults=array();
$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj) {
$link   = trim($linkObj->href);

    // if it is not a direct link but url reference found inside it, then extract
    if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&amp;sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
        $link = $matches[1];
    } else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
        continue;
    }
    array_push($searchresults,$link);
}

Upvotes: 1

Views: 718

Answers (2)

Pedro Lobito
Pedro Lobito

Reputation: 98901

Google presents a different html structure to devices without JavaScript enabled (file_get_html($url)). Temporarily Disable JavaScript on chrome and inspect the page. This way you'll be sure to get the correct div id's, classes, etc to use on your script.


Update based on your comments:

Google doesn't allow searching by date range via direct url if JavaScript is disabled. Although, you can still use the daterange Google operator to find pages that are indexed by Googlebot within the date range specified. The dates submitted must be in the Julian date format and the fractions should be omitted for this operator to work properly.

Example: daterange:2452671-2452671 lisbon

The daterange operator requires at least one proper search term and can be combined with other operators.


gregoriantojd()

To convert a Gregorian date to Julian date you can use the php function gregoriantojd( int $month , int $day , int $year ), i.e.:

$startDate = gregoriantojd(12, 28, 2011);
//2455924

$endDate = gregoriantojd(12, 28, 2014);
//2457020

Your search $url should look like this:

$url = "https://www.google.pt/search?q=lisbon+daterange:2455924-2457020&btnG=Search&num=100&gbv=1"

Final code:

include_once("simple_html_dom.php");

$startDate = gregoriantojd(12, 28, 2011); //2455924
$endDate = gregoriantojd(12, 28, 2014); //2457020
$nResults = "100";
$Query= "lisbon";

$url = "https://www.google.com/search?q=$Query+daterange:$startDate-$endDate&btnG=Search&num=$nResults&gbv=1";

echo $url .'<p>';
$html = file_get_html($url);
$searchresults=array();
$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj) {
$link   = trim($linkObj->href);

    // if it is not a direct link but url reference found inside it, then extract
    if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&amp;sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
        $link = $matches[1];
    } else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
        continue;
    }
    array_push($searchresults,$link);
}
print_r($searchresults);

/*
Array ( [0] => http://www.cnn.com/2014/01/25/travel/lisbon-coolest-city/ [1] => http://www.tripadvisor.com/Tourism-g189158-Lisbon_Lisbon_District_Central_Portugal-Vacations.html
etc...
*/

Upvotes: 1

bittomix
bittomix

Reputation: 171

You have linebreak inside url in the code you posted:

$url  = 'https://www.google.com/search?
q='.$Query.'&source=lnt&tbs=cdr%3A1%2'.$startDate.$EndDate.'&tbm=';

Linebreak is usually represened as LF character (0x0D, on unix-like systems) or CR+LF characters (0x0D+0x0A, windows).

Therefore if you have a close look at the url you request, your script send a requst with GET parameter named %0D%0Aq insted of q.

To corrent this you should put entire two lines above on one line or you cat put linebreak outside string literals, which are strings between each pair of single qoutes in your case e.g (dot at the beginning of the second line makes it easier not to over overlook the two-lined contatenation):

$url  = 'https://www.google.com/search?q=' 
  . $Query . '&source=lnt&tbs=cdr%3A1%2' . $startDate . $EndDate . '&tbm=';

Upvotes: 0

Related Questions