zigojacko
zigojacko

Reputation: 2063

Need to Fix Scrape PHP Script

We've a PHP script that scrapes search engine results pages and outputs clients website positions into a bespoke report suite for their domains.

Google changed something in the first week of February which prevented our script from detecting the domain on the page and I haven't currently got the original developer in the office nor can any of our other staff resolve this.

I pretty sure I know where the issue lies in the script, it's just, as I'm not a developer, I'm unsure what each line is actually doing. Our script uses the relevant classes from the search results to determine where what we're looking for is actually situated.

The script itself still runs and outputs the HTML fine. It's purely just the part of the script that says look for 'domain' on page that isn't being detected.

I appreciate that you're probably going to need a lot more information from me in order to advise what the issue is and I am happy to provide the file/coding as necessary. I would be prepared to pay for a fix on this too if necessary.

Below is where I feel the issue is occurring:-

// Note our use of ===.  Simply == would not work as expected
// because the position of 'a' was the 0th (first) character.
if ($pos4 === false) {
    $mystring5 = $val[0];
    $findme5 = $prevlink;
    $pos5 = @strpos($mystring5, $findme5);
    // Note our use of ===.  Simply == would not work as expected
    // because the position of 'a' was the 0th (first) character.
    if ($pos5 === false) {
        $serp = $serp + 1;
        echo '<b>'.$serp.'.</b> '.$val[0].'<br /><br />';
        $link = get_string_between($val[1], 'href="', '" onmousedown');
        $link = str_replace('https://','',$link);
        $link = str_replace('http://','',$link);
        $link = str_replace('www.','',$link);
        $link;
        $prevlink = $link;
        $prevlink = str_replace(strstr($prevlink, '/'), "", $prevlink);
        $sitelen = strlen($row_site_check['website_name']);
        $sitefrom_link = substr($link, 0, $sitelen);
        if ($sitefrom_link == $row_site_check['website_name']) {
            $site_found = 1;
            $rank_postion = $serp;
            $site_link = $link;
            $con = mysql_connect("localhost","dbname","dbpass");
            if (!$con)
            {
                die('Could not connect: ' . mysql_error());
            }

Any help would be greatly appreciated.

Thanks.

Upvotes: 1

Views: 443

Answers (1)

John
John

Reputation: 7826

Check out the Google rank scraper (php, opensource)

I am using software based on it daily since it was released and there was no change of Googles layout in February that broke anything as far as I can tell.

I'm not sure if you'll like the answer but the reason is likely that the Rank Scraper I pasted uses DOM to parse the HTML of google while you seem to rely on regular expressions and string operations.
I've personally tried to make a scraper based on such methods in the past and found that it requires a lot of maintenance work to keep it running. Sometimes real ugly workarounds.
When using DOM small changes usually don't even damage anything and otherwise adapting the code might be easier.
In the past few years the DOM code of that parser was working without major interruption, only two times a small change had to be made. And Google did change a lot on their site in that time, it just didn't cause ill effects.

The DOM functions of the above linked checker can be found in the functions.php file

function process_raw($htmdata,$page)

Upvotes: 1

Related Questions