Muhambi
Muhambi

Reputation: 3522

Issue with Preg_match

I make a simple application to take recipe info from websites like allrecipes.com. I'm using preg_match, but something is not working.

$geturl = file_get_contents("http://allrecipes.com/Recipe/Brown-Sugar-Smokies/Detail.aspx?src=rotd");
          preg_match('#<title>(.*) - Allrecipes.com</title>#', $geturl, $match);
          $name = $match[1];
          echo $name;

I'm just trying to take the title of the page (minus the - Allrecipes.com part) and put it into a variable, but all that turns up is blank.

Upvotes: 1

Views: 90

Answers (3)

Mike
Mike

Reputation: 11

You should get whole title first, then strip it using PHP, like so:

<?php

$raw_html=file_get_contents('http://www.allrecipes.com');
if (empty($raw_html)) {
    throw new \RuntimeException('Fetch empty');
}

$matches=array();
if (preg_match('/<title>(.*)<\/title>/s', $raw_html, $matches) === false) {
    throw new \RuntimeException('Regex error');
}

$title=trim($matches[1]);

// you should strip your title here
echo $title;

Upvotes: 1

raina77ow
raina77ow

Reputation: 106483

There were two problems in this pattern. First, there was a newline symbol after the <title> which wasn't captured by . (as without /s modifier . is literally 'any symbol but EOL one'). Second, the Allrecipes.com text was actually NOT followed by </title> substring, there was a newline separating them.

Taking into account the fact that \s covers both normal whitespace and line separating one, you can just alter your regex like this:

'#<title>\s*(.*?) - Allrecipes.com\s*</title>#s'

/s modifier is not actually relevant here (cudos to minitech for noticing that), as the title in this recipe is one-line, and all "\n" symbols will be covered by \s* subexpression. But I'd still suggest leaving it there, so that multi-line titles won't catch you off-guard.

I've replaced .* with .*? for efficiency sake here: as the string you're looking for is quite short, it makes sense to use non-greedy quantifier here.

Upvotes: 2

Ry-
Ry-

Reputation: 225291

If you look at the source of the page, you'll notice that <title> contains some padding around the actual text, for which you need to compensate.

'#<title>\s*(.*) - Allrecipes.com\s*</title>#'

Upvotes: 3

Related Questions