Redbox
Redbox

Reputation: 1477

PHP: Preg_match_all to extract html into string

I have html like this:

  <ul id="video-tags">
            <li><em>Tagged: </em></li>
                    <li><a href="/tags/sports">sports</a>, </li>
                            <li><a href="/tags/entertain">entertain</a>, </li>
                            <li><a href="/tags/funny">funny</a>, </li>
                            <li><a href="/tags/comedy">comedy</a>, </li>
                            <li><a href="/tags/automobile">automobile</a>, </li>
                    <li>more <a href="/tags/"><strong>tags</strong></a>.</li>
  </ul>

How can I extract the sports, entertain, funny, comedy, automobile into string

my php preg_match_all look like this:

preg_match_all('/<a href\="\/tags\/(.*?)\">(.*?)<\/a>, <\/li>/', $this->page, $matches);
echo var_dump($matches);    
echo implode(' ', $tags);  

It does not work.

Upvotes: 4

Views: 7583

Answers (3)

Shiplu Mokaddim
Shiplu Mokaddim

Reputation: 57690

This small regex does the same thing too.

preg_match_all('|tags/[^>]*>([^<]*)|', $str, $matches);

Also using DOMDocuemnt.

$d = new DOMDocument();
$d->loadHTML($str);
$as = $d->getElementsByTagName('a');
$result = array();
for($i=0;$i<($as->length-1); $i++)
    $result[]=$as->item($i)->textContent;

echo implode(' ', $result);  

Upvotes: 2

PenguinCoder
PenguinCoder

Reputation: 4367

I'm not sure how you're getting $this->page from, however the following should work as you're expecting:

http://ideone.com/KhWkEg

<?php
$page = 'subject string ...';

preg_match_all('/<a href\="\/tags\/(.*?)\">(.*?)<\/a>, <\/li>/', $page, $matches);

echo implode(', ', $matches[1]);  
?>

Substitute the $page variable for your $this->page so long as it is still a string.

However, I'd suggest not trying to parse HTML with Regular Expressions. Instead, use a library like PHP DOM document or SimpleHTMLdom to properly parse HTML.

Upvotes: 4

user4035
user4035

Reputation: 23759

This worked perfectly for me:

preg_match_all('/<a href\="\/tags\/(.*?)\">.*?<\/a>, <\/li>/', $str, $matches);
echo implode(',', $matches[1]);

Prints: sports,entertain,funny,comedy,automobile

$this->page is probably empty, that's why you are not getting any data.

Why do you put the brackets twice in regexp? You have the same words both in url and text of the link.

Upvotes: 1

Related Questions