lethalMango
lethalMango

Reputation: 4491

PHP - Processing a Screen Scraped Page

I have used previous topics on how to scrape a webpage successfully using cURL and PHP. I have managed to get that part working fine, what I need to do is process some information from the page that has no identifiable classes / markup that I can use easily. The example code I have is:

<h3>Building details:</h3>
<p>Disabled ramp access<br />
  Male, female and disabled toilets available</p>
  <br/>
  <p><strong>Appointment lead times:</strong></p>
  <p><strong>Type 1</strong>:&nbsp; 8 weeks<br />
  <strong>Type 2</strong>:&nbsp;5 weeks<br />
  <strong>Type 3</strong>:&nbsp;3 weeks<br />
  <strong>Type 4</strong>:&nbsp;3 weeks
</p>

What I need to do is get the number of weeks lead time for the different types of appointment, mainly type 1. Sometimes appointment lead times are unavailable and states:

<p><strong>Appointment lead times:</strong></p>
<p><strong>Type 1</strong>:&nbsp; No information available<br />

I have looked at several methods, RegEx, Simple DOM Parser etc but haven't really got a solution to what I am trying to achieve.

Many thanks.

Upvotes: 0

Views: 254

Answers (2)

Mr Coder
Mr Coder

Reputation: 8186

use http://php.net/manual/en/book.tidy.php to convert into valid xml , then you can easily query using xpath via simplexml http://www.w3schools.com/php/php_xml_dom.asp

Upvotes: 1

Chris Baker
Chris Baker

Reputation: 50592

When doing this kind of thing, it can get messy. You have to find some point in the code to break it apart in a reliable way. Your sample there has one spot I can see: Type 1</strong>:&nbsp;. So, I would do this:

$parts = explode('Type 1</strong>:&nbsp;', $text);

Now, the first bit of $parts[1] will have either your timeframe, or the no information message. Let's use the <br /> at the end to chop it:

if (count($parts) == 2) {
  $parts = explode('<br />', $parts[1]);
  $parts = trim(str_replace(' weeks', '', $parts[0]));
}

Now, $parts has our message, or our timeframe as a number. is_numeric will show the way! This is a dirty method, but scraping page data usually is. Be sure to check the results of each step before assuming you're good for the next.

Upvotes: 1

Related Questions