Reputation: 13150

PHP - Advanced Regex Help needed

So I have many large text paragraphs to parse. The end goal is to separate the paragraphs into smaller postings, so I can insert them into mysql.

Here's a very short example of one of the paragraphs in a string:

<?php
$longstring = '

(<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
Lots of text entered here under the first line.<br>And most of it is html, since it is for displaying in a web browser.<br></br></br>

(<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
Forgot to put one more thing in the notes.........<br>blah blah blah
(<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
Groceries list:<br>Watermelons<br>Floss<br><br>email doctor
';

?>

Yep, I have a freaky project of parsing these strings for each entry. Yes, I agree with anyone that this is not a cool task. the original developer allowed for appending text to the original text. Not a bad idea for some occasions, but for me it is.

I do need help with how to RegEx this beast and place it into a foreach loop so I can start cleaning it up.

Here's how far I got:

<?php

if(preg_match_all('/\(<b>.*?<hr>/', $longstring, $matches)){
print_r($matches);
}
/* output: 
Array 
( 
    [0] => Array 
        ( 
         [0] => (<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
         [1] => (<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr> 
         [2] => (<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr> 
        ) 
) 
*/ 
?>

So, I'm actually doing pretty good with looping through the tops of each entry. I'm kinda proud I figured that out. (regex is my nemesis)

So now I'm stuck figuring out how to include the actual text below each iteration.

Anyone have an idea on how I can adjust the preg_match_all to account for the text below each "header"?

Upvotes: 1

Answers (3)

user1236048

Reputation: 5612

Try this

if(preg_match_all('/\(<b>(?:(?!\(<b>).)*/s', $longstring, $matches)){
  print_r($matches);
}

Upvotes: 0

Andrei Filonov

Reputation: 834

Try to use preg_split instead:

$matches  = preg_split("/\s*(\(<b>.*?<hr>)\s*/s", trim($longstring), null, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

print_r($matches);

Note: trim is applied on your string to cut leading and trailing spaces.

Result will be something like

Array
(
    [0] => (<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
    [1] => Lots of text entered here under the first line.<br>And most of it is html, since it is for displaying in a web browser.<br></br></br>
    [2] => (<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
    [3] => Forgot to put one more thing in the notes.........<br>blah blah blah
    [4] => (<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
    [5] => Groceries list:<br>Watermelons<br>Floss<br><br>email doctor
)

Upvotes: 1

Mark Leighton Fisher

Reputation: 5703

This is going to be easier if you parse the HTML rather than just trying to regex it, unless you can guarantee the format of the HTML.

You might want to look at Robust and Mature HTML Parser for PHP.

Upvotes: 0

PHP - Advanced Regex Help needed

Answers (3)

Related Questions