EchO
EchO

Reputation: 1364

Scrape HTML Page that redirects to itself using Curl PHP

So i'm trying to scrape this page: http://www.asx.com.au/asx/statistics/todayAnns.do

it seems that my code can't get the whole page html code , it acts very wierd.

I've tried with simple html dom, but nothing works.

    $base = "http://www.asx.com.au/asx/statistics/todayAnns.do";

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, false);
    curl_setopt($curl, CURLOPT_HEADER, false);
    curl_setopt($curl, CURLOPT_URL, $base);
    curl_setopt($curl, CURLOPT_REFERER, $base);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $str = curl_exec($curl);
    curl_close($curl);
    echo htmlspecialchars($str);

This shows mostly javascript and i can't get the page. My goal is to scrape that middle table on the url.

Upvotes: 0

Views: 696

Answers (2)

Rx Seven
Rx Seven

Reputation: 529

If you don't need the most recent data then you can use the cached version of the page from Google.

<?php

use Scraper\Scrape\Crawler\Types\GeneralCrawler;
use Scraper\Scrape\Extractor\Types\MultipleRowExtractor;

require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');

// Create crawler
$crawler = new GeneralCrawler(
    'http://webcache.googleusercontent.com/search?q=cache:http://www.asx.com.au/asx/statistics/todayAnns.do&num=1&strip=0&vwsrc=0'
);

// Setup configuration
$configuration = new \Scraper\Structure\Configuration();
$configuration->setTargetXPath('//div[@class="page"]//table');
$configuration->setRowXPath('.//tr');
$configuration->setFields(
    [
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Headline',
                'xpath' => './/td[3]',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Published',
                'xpath' => './/td[1]',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Pages',
                'xpath' => './/td[4]',
            ]
        ),
        new \Scraper\Structure\AnchorField(
            [
                'name'               => 'Link',
                'xpath'              => './/td[5]/a',
                'convertRelativeUrl' => false,
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Code',
                'xpath' => './/text()',
            ]
        ),
    ]
);

// Extract  data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();
print_r($data);

I was able to get the following data using above code.

Array
(
    [0] => Array
        (
            [Code] => ASX
            [hash] => 6e16c02b10a10baf739c2613bc87f906
        )

    [1] => Array
        (
            [Headline] => Initial Director's Interest Notice
            [Published] => 10:57 AM
            [Pages] => 1
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868833
            [Code] => STO
            [hash] => aa2ea9b1b9b0bc843a4ac41e647319b4
        )

    [2] => Array
        (
            [Headline] => Becoming a substantial holder
            [Published] => 10:53 AM
            [Pages] => 2
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868832
            [Code] => AKG
            [hash] => f8ff8dfde597a0fc68284b8957f38758
        )

    [3] => Array
        (
            [Headline] => LBT Investor Conference Call Business Update
            [Published] => 10:53 AM
            [Pages] => 9
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868831
            [Code] => LBT
            [hash] => cc78f327f2b421f46036de0fce270a6d
        )

...

Disclaimer: I used https://github.com/rajanrx/php-scrape framework and I am an author of that library. You can grab data using simple curl as well using the xpath listed above.I hope this might be helpful :)

Upvotes: 1

bhar1red
bhar1red

Reputation: 469

CURL can load only markup of the page. The above page uses javascript to load data after page has been loaded. You might have to use PhantomJS or Splash.

This link might help : https://stackoverflow.com/a/20554152/3086531

For fetching data, on serverside, We can use phantomjs as library inside PHP. Execute page inside phantomjs, then dump data into php using exec command.

This article has step-by-step process to do it. http://shout.setfive.com/2015/03/30/7817/

Upvotes: 0

Related Questions