Reputation: 3522
I make a simple application to take recipe info from websites like allrecipes.com.
I'm using preg_match
, but something is not working.
$geturl = file_get_contents("http://allrecipes.com/Recipe/Brown-Sugar-Smokies/Detail.aspx?src=rotd");
preg_match('#<title>(.*) - Allrecipes.com</title>#', $geturl, $match);
$name = $match[1];
echo $name;
I'm just trying to take the title of the page (minus the - Allrecipes.com
part) and put it into a variable, but all that turns up is blank.
Upvotes: 1
Views: 90
Reputation: 11
You should get whole title first, then strip it using PHP, like so:
<?php
$raw_html=file_get_contents('http://www.allrecipes.com');
if (empty($raw_html)) {
throw new \RuntimeException('Fetch empty');
}
$matches=array();
if (preg_match('/<title>(.*)<\/title>/s', $raw_html, $matches) === false) {
throw new \RuntimeException('Regex error');
}
$title=trim($matches[1]);
// you should strip your title here
echo $title;
Upvotes: 1
Reputation: 106483
There were two problems in this pattern. First, there was a newline symbol after the <title>
which wasn't captured by .
(as without /s
modifier .
is literally 'any symbol but EOL one'). Second, the Allrecipes.com
text was actually NOT followed by </title>
substring, there was a newline separating them.
Taking into account the fact that \s
covers both normal whitespace and line separating one, you can just alter your regex like this:
'#<title>\s*(.*?) - Allrecipes.com\s*</title>#s'
/s
modifier is not actually relevant here (cudos to minitech for noticing that), as the title in this recipe is one-line, and all "\n" symbols will be covered by \s*
subexpression. But I'd still suggest leaving it there, so that multi-line titles won't catch you off-guard.
I've replaced .*
with .*?
for efficiency sake here: as the string you're looking for is quite short, it makes sense to use non-greedy quantifier here.
Upvotes: 2
Reputation: 225291
If you look at the source of the page, you'll notice that <title>
contains some padding around the actual text, for which you need to compensate.
'#<title>\s*(.*) - Allrecipes.com\s*</title>#'
Upvotes: 3