user984621
user984621

Reputation: 48443

How to parse this kind of HTML code with using PHP?

First of all, I found some threads here on SO, for example here, but it's not exactly what I am looking for.

Here is a sample of text that I have:

Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook

The desired output:

2012-12-13
Peter Novak
books,cinema,facebook

I need to save this information into our database, but I don't know, how to detect between the <b> tags the value (eg. Date) and then immediately the value (in this case : 2012-12-13)...

I would be grateful for every help with this, thank you!

Upvotes: 0

Views: 120

Answers (4)

The Alpha
The Alpha

Reputation: 146191

Using PHP Simple HTML DOM Parser you can achieve this easily (just like jQuery)

include('simple_html_dom.php');
$html = str_get_html('Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook');

Or

$html = file_get_html('http://your_page.com/');

then

foreach($html->find('text') as $t){
    if(substr($t, 0, 1)==':')
    {
        // do whatever you want
        echo substr($t, 1).'<br />';
    }
}

The output of the example is given below

2012-12-13
Peter Novak
books,cinema,facebook

Upvotes: 0

hafichuk
hafichuk

Reputation: 10781

Assuming that the format is consistent, then explode can work for you:

<?php
$text = "Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook";
$tokenized = explode(': ', $text);
$tokenized[1] = explode("<br", $tokenized[1]);
$tokenized[2] = explode("<br", $tokenized[2]);
$tokenized[3] = explode("<br", $tokenized[3]);

$date = $tokenized[1][0];
$name = $tokenized[2][0];
$hobby = $tokenized[3][0];

echo $date;
echo $name;
echo $hobby;

?>

Upvotes: 0

John Dvorak
John Dvorak

Reputation: 27277

Since there's not much DOM to traverse, there's not much a DOM traversal tool can do with this.

This should work:

1) Remove everything before the b tag.

2) Remove the b tags. A DOM traversal tool can do this, but if they are pure text, even a regex can do it, and it can remove the colon and the subsequent whitespace in the same pass: <b\s*>[^<]+</b\s*>:\s*

3) Change sequences of br tags to bare newlines (do you really want to?). The DOM traversal tool can do this, but so can regexes: (?:<br\s*/?>)+

$html = preg_replace('#^[^<]+#', "", $html);
$html = preg_replace('#<b\s*>[^<]+</b\s*>:\s*#', "", $html);
$html = preg_replace('#(?:<br\s*/?>)+#', "\n", $html);

Upvotes: 1

Nerbiz
Nerbiz

Reputation: 134

If <b>Date</b>, <b>Name</b>, <b>Hobby</b> and the <br />'s will always be there in that way, I suggest you use strpos() and substr().

For instance, to get the date:

// Get start position, +13 because of "<b>Date</b>: "
$dateStartPos = strpos($yourText, "<b>Date</b>") + 13;
// Get end position, use dateStartPos as offset
$dateEndPos = strpos($yourText, "<br />", $dateStartPos);
// Cut out the date, the length is the end position minus the start position
$date = substr($yourText, $dateStartPos, ($dateEndPos - $dateStartPos));

Upvotes: 0

Related Questions