Scrape HTML Page that redirects to itself using Curl PHP

Question

So i'm trying to scrape this page: http://www.asx.com.au/asx/statistics/todayAnns.do

it seems that my code can't get the whole page html code , it acts very wierd.

I've tried with simple html dom, but nothing works.

    $base = "http://www.asx.com.au/asx/statistics/todayAnns.do";

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, false);
    curl_setopt($curl, CURLOPT_HEADER, false);
    curl_setopt($curl, CURLOPT_URL, $base);
    curl_setopt($curl, CURLOPT_REFERER, $base);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $str = curl_exec($curl);
    curl_close($curl);
    echo htmlspecialchars($str);

This shows mostly javascript and i can't get the page. My goal is to scrape that middle table on the url.

Rx Seven · Accepted Answer

If you don't need the most recent data then you can use the cached version of the page from Google.

setTargetXPath('//div[@class="page"]//table');
$configuration->setRowXPath('.//tr');
$configuration->setFields(
    [
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Headline',
                'xpath' => './/td[3]',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Published',
                'xpath' => './/td[1]',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Pages',
                'xpath' => './/td[4]',
            ]
        ),
        new \Scraper\Structure\AnchorField(
            [
                'name'               => 'Link',
                'xpath'              => './/td[5]/a',
                'convertRelativeUrl' => false,
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Code',
                'xpath' => './/text()',
            ]
        ),
    ]
);

// Extract  data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();
print_r($data);

I was able to get the following data using above code.

Array
(
    [0] => Array
        (
            [Code] => ASX
            [hash] => 6e16c02b10a10baf739c2613bc87f906
        )

    [1] => Array
        (
            [Headline] => Initial Director's Interest Notice
            [Published] => 10:57 AM
            [Pages] => 1
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868833
            [Code] => STO
            [hash] => aa2ea9b1b9b0bc843a4ac41e647319b4
        )

    [2] => Array
        (
            [Headline] => Becoming a substantial holder
            [Published] => 10:53 AM
            [Pages] => 2
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868832
            [Code] => AKG
            [hash] => f8ff8dfde597a0fc68284b8957f38758
        )

    [3] => Array
        (
            [Headline] => LBT Investor Conference Call Business Update
            [Published] => 10:53 AM
            [Pages] => 9
            [Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868831
            [Code] => LBT
            [hash] => cc78f327f2b421f46036de0fce270a6d
        )

...

Disclaimer: I used https://github.com/rajanrx/php-scrape framework and I am an author of that library. You can grab data using simple curl as well using the xpath listed above.I hope this might be helpful :)

Scrape HTML Page that redirects to itself using Curl PHP

Answers (2)

Related Questions