Ali
Ali

Reputation: 267077

Parsing a mix of structured and unstructured text

I need to parse blocks of text which are in a format something like this:

Today the weather is excellent bla bla bla.
<temperature>35</temperature>. 
I'm in a great mood today. 
<item>Desk</item>

I want to parse text like this, and translate it into an array which resembles something like this:

$array[0]['text'] = 'Today the weather is excellent bla bla bla. ';
$array[0]['type'] = 'normalText';

$array[1]['text'] = '35';
$array[1]['type'] = 'temperature';

$array[2]['text'] = ". I'm in a great mood today.";
$array[2]['type'] = 'normalText';

$array[3]['text'] = 'Desk';
$array[3]['type'] = 'item';

Essentially, I want the array to contain all of the text in the same order as in the original text, but split into types: Normal text (meaning stuff which wasn't between any tags), and other types like temperature, item, which were determined by the tags the text was between.

Is there a way to do this (i.e seperate the text into normal text, and other types, using regular expressions) or should I behind the scenes convert the text into properly structured text, like:

<normal>Today the weather is excellent bla bla bla.</normal>
<temperature>35</temperature>.
<normal> I'm in a great mood today.</normal><item>Desk</item>

Before it tries to parse the text?

Upvotes: 1

Views: 437

Answers (2)

Carlos
Carlos

Reputation: 5072

EDIT: Now it works exactly as expected!

Solution:

<?php

$code = <<<'CODE'
Today the weather is excellent bla bla bla.
<temperature>35</temperature>. 
I'm in a great mood today. 
<item>Desk</item>
CODE;

$result = array_filter(
    array_map(
        function ($element) {
            if (!empty($element)) {
                if (preg_match('/^\<([^\>]+)\>([^\<]+)\</', $element, $matches)) {
                    return array('text' => $matches[2],
                                 'type'    => $matches[1]);
                } else {
                    return array('text' => $element,
                                 'type'    => 'normal');
                }
            }
            return false;
        },
        preg_split('/(\<[^\>]+\>[^\<]+\<\/[^\>]+\>)/', $code, null, PREG_SPLIT_DELIM_CAPTURE)
    )
);

print_r($result);

Output:

Array
(
    [0] => Array
        (
            [text] => Today the weather is excellent bla bla bla.

            [type] => normal
        )

    [1] => Array
        (
            [text] => 35
            [type] => temperature
        )

    [2] => Array
        (
            [text] => . 
I'm in a great mood today. 

            [type] => normal
        )

    [3] => Array
        (
            [text] => Desk
            [type] => item
        )

)

Upvotes: 3

cerealy
cerealy

Reputation: 122

Try reading through the text, line by line. You have 2 cases. Adding normal text and adding text that has a special tag. While adding the normal text to a variable, look for a tag with regexp.

preg_match("/\<(\w)\>/", $line_from_text, $matches) 

matches the tag, the ()'s saves the word to use with your array in $matches. Now just add text to a variable until you meet the end tag. Hope this helps.

Upvotes: 1

Related Questions