user1807556
user1807556

Reputation: 31

Parsing CDATA from Javascript

This is my first post and I'm sorry if I'm doing it wrong but here we go:

I've been working on a project that should scrape values from a website. The values are variables in a javascript array. I'm using the PHP Simple HTML DOM and it works with the normal scripts but not the one stored in CDATA-blocks. Therefore, I'm looking for a way to scrape data within the CDATA-block. Unfortunately, all the help I could find was for XML-files and I'm scraping from a HTML file.

The javascript I'm trying to scrape is a follows:

<script type="text/javascript">
//<![CDATA[
var data = [{"value":8.41,"color":"1C5A0D","text":"17/11"},{"value":9.86,"color":"1C5A0D","text":"18/11"},{"value":7.72,"color":"1C5A0D","text":"19/11"},{"value":9.42,"color":"1C5A0D","text":"20/11"}];
//]]>
</script>

What I need to scrape is the "value"-variable in the var data.

The problem was that I tried to replace the CDATA string on an object. The following code works perfectly :-)

include('simple_html_dom.php');

$lines = file_get_contents('http://www.virtualmanager.com/players/7793477-danijel-pavliuk/training');

$lines = str_replace("//<![CDATA[","",$lines);
$lines = str_replace("//]]>","",$lines);

$html = str_get_html($lines);

foreach($html->find('script') as $element) {
    echo $element->innertext;
}

I will provide you with more information if needed.

Upvotes: 3

Views: 5819

Answers (1)

millimoose
millimoose

Reputation: 39950

A decent HTML parser shouldn't require Javascript to be wrapped in a CDATA block. If they're throwing it off, just remove them from the HTML before parsing, doing something like this:

  1. Download the HTML file into a string, using file_get_contents() or cURL if your host disabled HTTP support in that function.
  2. Get rid of the //<![CDATA[ and //]]> bits using str_replace()
  3. Parse the HTML from the cleaned string using Simple DOM's str_get_html()
  4. Process the DOM object as before.

Upvotes: 2

Related Questions