Reputation: 31
This is my first post and I'm sorry if I'm doing it wrong but here we go:
I've been working on a project that should scrape values from a website. The values are variables in a javascript array. I'm using the PHP Simple HTML DOM and it works with the normal scripts but not the one stored in CDATA-blocks. Therefore, I'm looking for a way to scrape data within the CDATA-block. Unfortunately, all the help I could find was for XML-files and I'm scraping from a HTML file.
The javascript I'm trying to scrape is a follows:
<script type="text/javascript">
//<![CDATA[
var data = [{"value":8.41,"color":"1C5A0D","text":"17/11"},{"value":9.86,"color":"1C5A0D","text":"18/11"},{"value":7.72,"color":"1C5A0D","text":"19/11"},{"value":9.42,"color":"1C5A0D","text":"20/11"}];
//]]>
</script>
What I need to scrape is the "value"-variable in the var data.
The problem was that I tried to replace the CDATA string on an object. The following code works perfectly :-)
include('simple_html_dom.php');
$lines = file_get_contents('http://www.virtualmanager.com/players/7793477-danijel-pavliuk/training');
$lines = str_replace("//<![CDATA[","",$lines);
$lines = str_replace("//]]>","",$lines);
$html = str_get_html($lines);
foreach($html->find('script') as $element) {
echo $element->innertext;
}
I will provide you with more information if needed.
Upvotes: 3
Views: 5819
Reputation: 39950
A decent HTML parser shouldn't require Javascript to be wrapped in a CDATA
block. If they're throwing it off, just remove them from the HTML before parsing, doing something like this:
file_get_contents()
or cURL if your host disabled HTTP support in that function.//<![CDATA[
and //]]>
bits using str_replace()
str_get_html()
Upvotes: 2