Reputation: 145
I am trying to query the google search engine by date to get the first page results then process it. The query I am currently using returns results but not in the date range I set; if I copied the same query into google it works for the date but not from my PHP script. The script returns only current or normal results as if the date parameter was not set. part of the code snippet used is below. The query I am referring to is below as well as in the code snippet posted in the $url variable.
Query:https://www.google.com/search?q='.$Query.'&source=lnt&tbs=cdr%3A1%2'.$startDate.$EndDate.'&tbm=
$Query= $_POST['Query'];
$Query=str_replace(" ","+",$Query);
if ($_POST['Start_date']==''){
$startday='1';
$startmonth='11';
$startyear='2011';
}
if ($_POST['End_date']==''){
$endday='1';
$endmonth='11';
$endyear='2013';
}
$startDate='Ccd_min%3A'.$startmonth.'%2F'.$startday.'%2F'.$startyear.'.%2';
$EndDate='Ccd_max%3A'.$endmonth.'%2F'.$endday.'%2F'.$endyear.'';
if ($_POST['Query']!=''){
$url = 'https://www.google.com/search?
q='.$Query.'&source=lnt&tbs=cdr%3A1%2'.$startDate.$EndDate.'&tbm=';
echo $url .'<p>';
$html = file_get_html($url);
$searchresults=array();
$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj) {
$link = trim($linkObj->href);
// if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
array_push($searchresults,$link);
}
Upvotes: 1
Views: 718
Reputation: 98901
Google presents a different html structure to devices without JavaScript
enabled (file_get_html($url)
). Temporarily Disable JavaScript on chrome and inspect the page. This way you'll be sure to get the correct div id's
, classes
, etc to use on your script.
Google doesn't allow searching by date range via direct url if JavaScript is disabled.
Although, you can still use the daterange
Google operator to find pages that are indexed by Googlebot within the date range specified. The dates submitted must be in the Julian date
format and the fractions should be omitted for this operator to work properly.
Example: daterange:2452671-2452671 lisbon
The daterange
operator requires at least one proper search term and can be combined with other operators.
gregoriantojd()
To convert a Gregorian date
to Julian date
you can use the php function gregoriantojd( int $month , int $day , int $year )
, i.e.:
$startDate = gregoriantojd(12, 28, 2011);
//2455924
$endDate = gregoriantojd(12, 28, 2014);
//2457020
Your search $url
should look like this:
$url = "https://www.google.pt/search?q=lisbon+daterange:2455924-2457020&btnG=Search&num=100&gbv=1"
include_once("simple_html_dom.php");
$startDate = gregoriantojd(12, 28, 2011); //2455924
$endDate = gregoriantojd(12, 28, 2014); //2457020
$nResults = "100";
$Query= "lisbon";
$url = "https://www.google.com/search?q=$Query+daterange:$startDate-$endDate&btnG=Search&num=$nResults&gbv=1";
echo $url .'<p>';
$html = file_get_html($url);
$searchresults=array();
$linkObjs = $html->find('h3.r a');
foreach ($linkObjs as $linkObj) {
$link = trim($linkObj->href);
// if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
array_push($searchresults,$link);
}
print_r($searchresults);
/*
Array ( [0] => http://www.cnn.com/2014/01/25/travel/lisbon-coolest-city/ [1] => http://www.tripadvisor.com/Tourism-g189158-Lisbon_Lisbon_District_Central_Portugal-Vacations.html
etc...
*/
Upvotes: 1
Reputation: 171
You have linebreak inside url in the code you posted:
$url = 'https://www.google.com/search?
q='.$Query.'&source=lnt&tbs=cdr%3A1%2'.$startDate.$EndDate.'&tbm=';
Linebreak is usually represened as LF
character (0x0D
, on unix-like systems) or CR
+LF
characters (0x0D
+0x0A
, windows).
Therefore if you have a close look at the url you request, your script send a requst with GET parameter named %0D%0Aq
insted of q
.
To corrent this you should put entire two lines above on one line or you cat put linebreak outside string literals, which are strings between each pair of single qoutes in your case e.g (dot at the beginning of the second line makes it easier not to over overlook the two-lined contatenation):
$url = 'https://www.google.com/search?q='
. $Query . '&source=lnt&tbs=cdr%3A1%2' . $startDate . $EndDate . '&tbm=';
Upvotes: 0