Reputation: 13120
So I have many large text paragraphs to parse. The end goal is to separate the paragraphs into smaller postings, so I can insert them into mysql.
Here's a very short example of one of the paragraphs in a string:
<?php
$longstring = '
(<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
Lots of text entered here under the first line.<br>And most of it is html, since it is for displaying in a web browser.<br></br></br>
(<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
Forgot to put one more thing in the notes.........<br>blah blah blah
(<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
Groceries list:<br>Watermelons<br>Floss<br><br>email doctor
';
?>
Yep, I have a freaky project of parsing these strings for each entry. Yes, I agree with anyone that this is not a cool task. the original developer allowed for appending text to the original text. Not a bad idea for some occasions, but for me it is.
I do need help with how to RegEx this beast and place it into a foreach loop so I can start cleaning it up.
Here's how far I got:
<?php
if(preg_match_all('/\(<b>.*?<hr>/', $longstring, $matches)){
print_r($matches);
}
/* output:
Array
(
[0] => Array
(
[0] => (<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
[1] => (<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
[2] => (<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
)
)
*/
?>
So, I'm actually doing pretty good with looping through the tops of each entry. I'm kinda proud I figured that out. (regex is my nemesis)
So now I'm stuck figuring out how to include the actual text below each iteration.
Anyone have an idea on how I can adjust the preg_match_all
to account for the text below each "header"?
Upvotes: 1
Views: 65
Reputation: 5602
Try this
if(preg_match_all('/\(<b>(?:(?!\(<b>).)*/s', $longstring, $matches)){
print_r($matches);
}
Upvotes: 0
Reputation: 834
Try to use preg_split instead:
$matches = preg_split("/\s*(\(<b>.*?<hr>)\s*/s", trim($longstring), null, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
print_r($matches);
Note: trim is applied on your string to cut leading and trailing spaces.
Result will be something like
Array
(
[0] => (<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
[1] => Lots of text entered here under the first line.<br>And most of it is html, since it is for displaying in a web browser.<br></br></br>
[2] => (<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
[3] => Forgot to put one more thing in the notes.........<br>blah blah blah
[4] => (<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
[5] => Groceries list:<br>Watermelons<br>Floss<br><br>email doctor
)
Upvotes: 1
Reputation: 5693
This is going to be easier if you parse the HTML rather than just trying to regex it, unless you can guarantee the format of the HTML.
You might want to look at Robust and Mature HTML Parser for PHP.
Upvotes: 0