Reputation: 1364
So i'm trying to scrape this page: http://www.asx.com.au/asx/statistics/todayAnns.do
it seems that my code can't get the whole page html code , it acts very wierd.
I've tried with simple html dom, but nothing works.
$base = "http://www.asx.com.au/asx/statistics/todayAnns.do";
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_URL, $base);
curl_setopt($curl, CURLOPT_REFERER, $base);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$str = curl_exec($curl);
curl_close($curl);
echo htmlspecialchars($str);
This shows mostly javascript and i can't get the page. My goal is to scrape that middle table on the url.
Upvotes: 0
Views: 696
Reputation: 529
If you don't need the most recent data then you can use the cached version of the page from Google.
<?php
use Scraper\Scrape\Crawler\Types\GeneralCrawler;
use Scraper\Scrape\Extractor\Types\MultipleRowExtractor;
require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');
// Create crawler
$crawler = new GeneralCrawler(
'http://webcache.googleusercontent.com/search?q=cache:http://www.asx.com.au/asx/statistics/todayAnns.do&num=1&strip=0&vwsrc=0'
);
// Setup configuration
$configuration = new \Scraper\Structure\Configuration();
$configuration->setTargetXPath('//div[@class="page"]//table');
$configuration->setRowXPath('.//tr');
$configuration->setFields(
[
new \Scraper\Structure\TextField(
[
'name' => 'Headline',
'xpath' => './/td[3]',
]
),
new \Scraper\Structure\TextField(
[
'name' => 'Published',
'xpath' => './/td[1]',
]
),
new \Scraper\Structure\TextField(
[
'name' => 'Pages',
'xpath' => './/td[4]',
]
),
new \Scraper\Structure\AnchorField(
[
'name' => 'Link',
'xpath' => './/td[5]/a',
'convertRelativeUrl' => false,
]
),
new \Scraper\Structure\TextField(
[
'name' => 'Code',
'xpath' => './/text()',
]
),
]
);
// Extract data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();
print_r($data);
I was able to get the following data using above code.
Array
(
[0] => Array
(
[Code] => ASX
[hash] => 6e16c02b10a10baf739c2613bc87f906
)
[1] => Array
(
[Headline] => Initial Director's Interest Notice
[Published] => 10:57 AM
[Pages] => 1
[Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868833
[Code] => STO
[hash] => aa2ea9b1b9b0bc843a4ac41e647319b4
)
[2] => Array
(
[Headline] => Becoming a substantial holder
[Published] => 10:53 AM
[Pages] => 2
[Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868832
[Code] => AKG
[hash] => f8ff8dfde597a0fc68284b8957f38758
)
[3] => Array
(
[Headline] => LBT Investor Conference Call Business Update
[Published] => 10:53 AM
[Pages] => 9
[Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868831
[Code] => LBT
[hash] => cc78f327f2b421f46036de0fce270a6d
)
...
Disclaimer: I used https://github.com/rajanrx/php-scrape framework and I am an author of that library. You can grab data using simple curl as well using the xpath listed above.I hope this might be helpful :)
Upvotes: 1
Reputation: 469
CURL can load only markup of the page. The above page uses javascript to load data after page has been loaded. You might have to use PhantomJS or Splash.
This link might help : https://stackoverflow.com/a/20554152/3086531
For fetching data, on serverside, We can use phantomjs as library inside PHP. Execute page inside phantomjs, then dump data into php using exec command.
This article has step-by-step process to do it. http://shout.setfive.com/2015/03/30/7817/
Upvotes: 0