Reputation: 48443
First of all, I found some threads here on SO, for example here, but it's not exactly what I am looking for.
Here is a sample of text that I have:
Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook
The desired output:
2012-12-13
Peter Novak
books,cinema,facebook
I need to save this information into our database, but I don't know, how to detect between the <b>
tags the value (eg. Date
) and then immediately the value (in this case : 2012-12-13
)...
I would be grateful for every help with this, thank you!
Upvotes: 0
Views: 120
Reputation: 146191
Using PHP Simple HTML DOM Parser you can achieve this easily (just like jQuery)
include('simple_html_dom.php');
$html = str_get_html('Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook');
Or
$html = file_get_html('http://your_page.com/');
then
foreach($html->find('text') as $t){
if(substr($t, 0, 1)==':')
{
// do whatever you want
echo substr($t, 1).'<br />';
}
}
The output of the example is given below
2012-12-13
Peter Novak
books,cinema,facebook
Upvotes: 0
Reputation: 10781
Assuming that the format is consistent, then explode
can work for you:
<?php
$text = "Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook";
$tokenized = explode(': ', $text);
$tokenized[1] = explode("<br", $tokenized[1]);
$tokenized[2] = explode("<br", $tokenized[2]);
$tokenized[3] = explode("<br", $tokenized[3]);
$date = $tokenized[1][0];
$name = $tokenized[2][0];
$hobby = $tokenized[3][0];
echo $date;
echo $name;
echo $hobby;
?>
Upvotes: 0
Reputation: 27277
Since there's not much DOM to traverse, there's not much a DOM traversal tool can do with this.
This should work:
1) Remove everything before the b
tag.
2) Remove the b
tags. A DOM traversal tool can do this, but if they are pure text, even a regex can do it, and it can remove the colon and the subsequent whitespace in the same pass: <b\s*>[^<]+</b\s*>:\s*
3) Change sequences of br
tags to bare newlines (do you really want to?). The DOM traversal tool can do this, but so can regexes: (?:<br\s*/?>)+
$html = preg_replace('#^[^<]+#', "", $html);
$html = preg_replace('#<b\s*>[^<]+</b\s*>:\s*#', "", $html);
$html = preg_replace('#(?:<br\s*/?>)+#', "\n", $html);
Upvotes: 1
Reputation: 134
If <b>Date</b>
, <b>Name</b>
, <b>Hobby</b>
and the <br />
's will always be there in that way, I suggest you use strpos() and substr().
For instance, to get the date:
// Get start position, +13 because of "<b>Date</b>: "
$dateStartPos = strpos($yourText, "<b>Date</b>") + 13;
// Get end position, use dateStartPos as offset
$dateEndPos = strpos($yourText, "<br />", $dateStartPos);
// Cut out the date, the length is the end position minus the start position
$date = substr($yourText, $dateStartPos, ($dateEndPos - $dateStartPos));
Upvotes: 0