Algernop K.
Algernop K.

Reputation: 477

preg_match misses some ids while fetching data with cURL

For learning purposes, I'm trying to fetch data from the Steam Store, where if the image game_header_image_full exists, I've reached a game. Both alternatives are sort of working, but there's a catch. One is really slow, and the other seems to miss some data and therefore not writing the URL's to a text file.

For some reason, Simple HTML DOM managed to catch 9 URL's, whilst the 2nd one (cURL) only caught 8 URL's with preg_match.

Question 1.

Is $reg formatted in a way that $html->find('img.game_header_image_full') would catch, but not my preg_match? Or is the problem something else?

Question 2.

Am I doing things correctly here? Planning to go for the cURL alternative, but can I make it faster somehow?

Simple HTML DOM Parser (Time to search 100 ids: 1 min, 39s. Returned: 9 URL.)

<?php
    include('simple_html_dom.php');

    $i = 0;
    $times_to_run = 100;
    set_time_limit(0);

    while ($i++ < $times_to_run) {
        // Find target image
        $url = "http://store.steampowered.com/app/".$i;
        $html = file_get_html($url);
        $element = $html->find('img.game_header_image_full');

        if($i == $times_to_run) {
            echo "Success!";
        }

        foreach($element as $key => $value){
        // Check if image was found
            if (strpos($value,'img') == false) {
                // Do nothing, repeat loop with $i++;

            } else {
                // Add (don't overwrite) to file steam.txt
                file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
            }
        }
    }
?>

vs. the cURL alternative.. (Time to search 100 ids: 34s. Returned: 8 URL.)

<?php

    $i = 0;
    $times_to_run = 100;
    set_time_limit(0);

    while ($i++ < $times_to_run) {

        $ch = curl_init();
        curl_setopt( $ch, CURLOPT_URL, 'http://store.steampowered.com/app/'.$i);
        curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
        $content = curl_exec($ch);

        $url = "http://store.steampowered.com/app/".$i;

        $reg = "/<\\s*img\\s+[^>]*class=['\"][^'\"]*game_header_image_full[^'\"]*['\"]/i";

        if(preg_match($reg, $content)) {
            file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
        }

    }

?>

Upvotes: 1

Views: 83

Answers (1)

Alex
Alex

Reputation: 14618

Well you shouldn't use regex with HTML. It mostly works, but when it doesn't, you have to go through hundreds of pages and figuring out which one is the failing one, and why, and correct the regex, then hope and pray that in the future nothing like that will ever happen again. Spoiler alert: it will.

Long story short, read this funny answer: RegEx match open tags except XHTML self-contained tags

Don't use regex to parse HTML. Use HTML parsers, which are complicated algorithms that don't use regex, and are reliable (as long as the HTML is valid). You are using one already, in the first example. Yes, it's slow, because it does more than just searching for a string within a document. But it's reliable. You can also play with other implementations, especially the native ones, like http://php.net/manual/en/domdocument.loadhtml.php

Upvotes: 1

Related Questions