Anders
Anders

Reputation: 37

regexp works when extracting HTML, but not with file_get_contents

This is my code

$file_string = file_get_contents('http://wiki.teamliquid.net/starcraft2/ASUS_ROG_NorthCon_2013');

preg_match_all('/<th.*>.*Organizer.*<a.*>(.*)<\/a>/msi', $file_string, $organizer);
if (empty($organizer[1])) {
    echo "Couldn't get organizer \n";
    $stats['organizer'] = 'ERROR';
}
else {
    $stats['organizer'] = $organizer[1];
}

I'm trying to get the "Organizer" field from the right-hand "League Information" box on http://wiki.teamliquid.net/starcraft2/ASUS_ROG_NorthCon_2013 but it isn't working.

However (and this is what I don't understand), when I copy the HTML from the page and change the $file_string variable to this:

$file_string = '<tr>
<th valign="top"> Organizer:
</th>
<td style="width:55%;"> <a rel="nofollow" target="_blank" class="external text" href="http://www.northcon.de/">NorthCon</a>
</td></tr>';

The regexp works. Perhaps it could be JavaScript or something? However, I'm able to extract the data of pretty much all of the other rows in the same box, using regexp. I swear I'm missing something obvious here, maybe I just need a set of fresh eyes?

Upvotes: 0

Views: 60

Answers (1)

Traxo
Traxo

Reputation: 19022

This code should work:

$file_string = file_get_contents('http://wiki.teamliquid.net/starcraft2/ASUS_ROG_NorthCon_2013');

preg_match_all('/<th.{0,30}>.*Organizer.*?<\/a>/msi', $file_string, $organizer);
print_r($organizer);
if (empty($organizer[0])) {
    echo "Couldn't get organizer \n";
    $stats['organizer'] = 'ERROR';
}
else {
    $stats['organizer'] = $organizer[0];
}

Instead of $organizer[1] put $organizer[0] because that will be your first (and only) result. You had to make .* lazy by putting question mark after it. That means that it will stop searching once it finds what its looking for.

For example this code

<a.*>(.*)<\/a>

Will search from first tag to last one on page (it doesn't stop when it finds </a>) while this code

<a.*?>(.*?)<\/a>

will stop searching after it finds first </a>

Check source code once you echo it. This will be result(I assume you wanted it like this with html included):

<th valign="top"> Organizer:
</th>
<td style="width:55%;"> <a rel="nofollow" target="_blank" class="external text" href="http://www.northcon.de/">NorthCon</a>

Upvotes: 2

Related Questions