Reputation: 21531

Split up words but not if it contains HTML

I need a function or some regex to split up spaces in a string but to treat an HTML tag as a word.

$str = 'one two <a href="">three</a> four';
$x = explode(" ", $str);
print_r($x);

/* Returns:
  Array
(
    [0] => one
    [1] => two
    [2] => <a
    [3] => href="">three</a>
    [4] => four
)

Looking for way to return:

Array
(
    [0] => one
    [1] => two
    [2] => <a href="">three</a>
    [3] => four
)

*/

Any ideas? Thanks

Upvotes: 1

Answers (4)

Fosco

Reputation: 38526

I wrote and tested this custom function. Give it a shot and let me know what you think.

function fireSplit($str) {
  if (strpos($str,"<") === FALSE) return explode(" ",$str);
  $str = trim($str);
  $out = array();
  $curIdx = 0;
  $endIdx = strlen($str) -1;

  while ($curIdx <= $endIdx) {
        if (substr($str,$curIdx,1) == " ") {
              $curIdx += 1;
              continue;
        }
        $nextspace = strpos($str," ",$curIdx);
        $nexttag = strpos($str,"<",$curIdx);
        $nexttag2 = strpos($str,"/",$nexttag);
        $nexttag3 = strpos($str,">",$nexttag2);

        if ($nextspace === FALSE) {
              $out[] = substr($str,$curIdx);
              $curIdx = $endIdx + 1;
              continue;
        }

        if ($nexttag !== FALSE && $nexttag < $nextspace && $nexttag2 !== FALSE && $nexttag3 !== FALSE) {
              $out[] = substr($str,$curIdx,($nexttag3 - $curIdx + 1));
              $curIdx = $nexttag3 + 1;
        } else {
              $out[] = substr($str,$curIdx,($nextspace - $curIdx));
              $curIdx = $nextspace;
        }
  }
return $out;
}

I called:

fireSplit("one two <a href=\"haha\">three</a> four");
fireSplit("a <b>strong</b> c d e f");

It returned:

array(4) {
  [0]=>
  string(3) "one"
  [1]=>
  string(3) "two"
  [2]=>
  string(24) "<a href="haha">three</a>"
  [3]=>
  string(4) "four"
}

array(6) {
  [0]=>
  string(1) "a"
  [1]=>
  string(13) "<b>strong</b>"
  [2]=>
  string(1) "c"
  [3]=>
  string(1) "d"
  [4]=>
  string(1) "e"
  [5]=>
  string(1) "f"
}

Upvotes: 0

AnimeCYC

Reputation: 36

This is a bit simpler then the above, haven't fully tested but give it a shot.

$str = 'one two <a href="">three</a> four';

if(preg_match_all('%(<[^<]+.*?>|[^\s]+)%', $str, $matches)) {
    array_shift($matches);
    print_r($matches);
}

Here is another version that I tested for about 5 minutes that works a bit better:

$str = 'one two <a href="omfg hi I have spaces"> three</a> four <script type="javascript"> var a = "hello"; </script><random tag>la la la la<nested>hello?</nested></random tag>';

if(preg_match_all('%(<[^<]+.*?>|[^\s]+)%', preg_replace('%([\s]\<|\>[\s])%', '$1', $str), $matches)) {
    array_shift($matches);
    echo '<pre>';
    print_r($matches);
    echo '</pre>';
}

Upvotes: 2

berkes

Reputation: 27553

preg_split('/(<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>)|| /, $text)

That would sometimes work. It splits on either a tag-set, or else a space.

However, what you want is simply not that simple. You should cover all cases of nested tags, tags where content has a space ([a href]Foo Bar Baz[/a]), and so on. For that, you best implement a proper XML (html) parser.

But it seems to me you have purpose with that array. Is it to count "words"? If so, the solution would be a much simpler function call that strips all HTML from the text (strip_tags()) and then apply your wordsplitter and count them.

Upvotes: 2

Rob Stevenson-Leggett

Reputation: 35679

Could do a regex replace on the strings before and after you use explode.

so it would go into explode like

<a_href="">test</a>

Beyond any simple cases though you're talking about parsing HTML which is not a good thing to do with RegEx.

There's plenty of questions on here about parsing html on here. Perhaps you could adapt on of them.

Upvotes: 0

Split up words but not if it contains HTML

Answers (4)

Related Questions