Matt
Matt

Reputation: 1133

Regex to parse out title of a post

I'm using cURL to grab a page and I want to parse out the title of the post (the actual text shown on the link, not the title attribute of the <a>).

The HTML is like this:

<li class="topic">
    <a title="Permanent Link to Blog Post" rel="bookmark" href="http://www.website.com/blog-post/">Title of blog post</a>
</li>

I tried using this code:

preg_match('/<\a title=\".*\" rel=\"bookmark\" href=\".*\">.*<\/a>/', $page, $matches);

But it's not working, PHP returns Array ( ) (an empty array).

Can anyone supply me the regex to do this? I've tried online generators but it goes right over my head. Cheers!

Upvotes: 1

Views: 391

Answers (4)

ghostdog74
ghostdog74

Reputation: 342303

here's another way

$str = <<<A
<li class="topic">
    <a title="Permanent Link to Blog Post" rel="bookmark" href="http://www.website.com/blog-post/">Title of blog post</a>
</li>
A;
$s = explode("</a>",$str);
foreach ($s as $a=>$b){
    if(strpos($b,"<a title")!==FALSE){
        $b=preg_replace("/.*<a title.*>/ms","",$b);
        print $b;
    }
}

output

$ php test.php
Title of blog post

Upvotes: 0

Cups
Cups

Reputation: 6896

$str = '<li class="topic"> <a title="Permanent Link to Blog Post" rel="bookmark" href="http://www.website.com/blog-post/"> Title of blog post</a> </li>; `

echo strip_tags( $str ) ;

Gives:

Title of blog post

Upvotes: 0

Felix Kling
Felix Kling

Reputation: 816312

Add parenthesis to your expression:

'/<a title=".*" rel="bookmark" href=".*">(.*)<\/a>/'

Everything between ( ) will be returned in the array.

Edit:

You have to remove all the backspaces before the quotation marks.

Edit2:

Just seen in the documentation for preg_match

If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches1 will have the text that matched the first captured parenthesized subpattern, and so on.

You should also test your expression with sample text to make sure that it really does what you want to do.

Upvotes: 1

Greg Bacon
Greg Bacon

Reputation: 139431

Assuming you want the attribute, you could use:

if (preg_match('/<a\s+[^>]*?\btitle="(.+?)"/', $page, $matches)) {
    echo $matches[1], "\n";
}

Parsing HTML can be tricky, and regular expressions aren't up to the job in the general case. For simple, sane documents, you can get away with it.

Just be aware that you're driving a screw with a hammer.

Upvotes: 0

Related Questions