Reputation: 267077

String parsing help

I have a string like the following:

$string = "
<paragraph>apples are red...</paragraph>
<paragraph>john is a boy..</paragraph>
<paragraph>this is dummy text......</paragraph>
";

I would like to split this string into an array contanining the text found between the <paragraph></paragraph> tags. E.g something like this:

$string = "
<paragraph>apples are red...</paragraph>
<paragraph>john is a boy..</paragraph>
<paragraph>this is dummy text......</paragraph>
";

$paragraphs = splitParagraphs($string);
/* $paragraphs now contains:
   $paragraphs[0] = apples are red...
   $paragraphs[1] = john is a boy...
   $paragraphs[1] = this is dummy text...
*/

Any ideas?

P.S it should be case insensitive, <paragraph>, <PARAGRAPH>, <Paragraph> should all be treated the same way.

Edit: This is not XML, there are a lot of things here which will break the structure of XML hence I cannot use SimpleXML etc. I need a regular expression which will parse this out.

Upvotes: 4

Answers (7)

Mike Cialowicz

Reputation: 10020

After your edits (case insensitive, and tags too big for XML parser to handle), the following should work:

$paragraphs = array();
$exploded = explode("</", $string);
unset($exploded[count($exploded) - 1]); //remove the useless, final "paragraph>" item
$exploded[0] = str_replace("<paragraph>", "", $exploded[0]); // first item is a special case
foreach($exploded as $item)
{
    array_push($paragraphs, str_replace("paragraph>\n<paragraph>", "", $item));
}

Upvotes: 0

intuited

Reputation: 24044

So assuming that you've got some stuff in the paragraphs that is going to break XML format, or you're just looking to learn a bit more about regexp parsing, this should get the job done for the example you've posted. It's not particularly robust, but that's why people like to use XML, because it's got a formal syntax that makes it easy to parse. or easier, anyway. In particular this solution depends on the string that's being parsed starting with a paragraph tag and ending with a paragraph close tag, and also on there being nothing but whitespace in between each pair of paragraphs. So it's a very literal solution to your example problem. But then since this is the only existing specification document for your custom data format it was the best I could do :)

$string = " <paragraph>apples are red...</paragraph> <paragraph>john is a boy..</paragraph> <paragraph>this is dummy text......</paragraph> ";
$paragraphs = preg_replace('/(^\s*<paragraph>|<\/paragraph>\s*$)/', '', preg_split('/(?<=<\/paragraph>)\s*(?=<paragraph>)/', $string));

What's going on here is that you're using, in the preg_split function call, zero-width lookaround assertions to find the beginning and end of each paragraph, and then calling preg_replace to crop out the tags from the beginning and end of each chunk. You end up with the contents of $paragraphs being

array (
  0 => 'apples are red...',
  1 => 'john is a boy..',
  2 => 'this is dummy text......',
)

Upvotes: 0

Kobi

Reputation: 138017

If this is a simple structure, with no nesting:

preg_split("#</?paragraph>#i", $string);

To ignore empty tokens:

preg_split("#</?paragraph>#i", $string, -1, PREG_SPLIT_NO_EMPTY);

Source: http://php.net/manual/en/function.preg-split.php

Upvotes: 2

Mark Byers

Reputation: 838216

If this is actually XML then I agree with the other answers. But if it isn't valid XML, but just something that looks vaguely like XML then you should not try to parse it with an XML parser. Instead you can use a regular expression:

$matches = array();
preg_match_all(":<paragraph>(.*?)</paragraph>:is", $string, $matches);
$result = $matches[1];
print_r($result);

Output:

Array
(
    [0] => apples are red...
    [1] => john is a boy..
    [2] => this is dummy text......
)

Note that the i means case-insensitive and the s allows new lines to match in the text. All text not inside paragraph tags will be ignored.

Upvotes: 5

Mike Cialowicz

Reputation: 10020

Well, you should use an XML parser, like SimpleXML or XMLReader.

However, if you want to hack something up, the following will work:

$string = str_replace("<paragraph>", "", $string);
$string = str_replace("</paragraph>", "", $string);
$paragraphs = explode("\n", $string);

This will work as long as you have one item per line. If you have everything on one line, replace the second line of code above, with:

$string = str_replace("</paragraph>", "\n", $string);

Good luck!

Upvotes: 0

zneak

Reputation: 138051

This furiously looks like XML. If it indeed is, you should use a SimpleXMLElement or any other XML-parcing facility of PHP.

$xml = new SimpleXMLElement('<root>' . $paragraphs . '</root>');

foreach($xml->paragraph as $paragraph)
{
    // do stuff to $paragraph; it's strval is the contents of the paragraph
}

Upvotes: 0

Brian Agnew

Reputation: 272277

If you're really parsing XML, then the PHP DOM is of use here. You may have a trivial example case above, but if you're parsing XML, I'd use a dedicated XML API.

Upvotes: 0

String parsing help

Answers (7)

Related Questions