Reputation: 21531
I need a function or some regex to split up spaces in a string but to treat an HTML tag as a word.
$str = 'one two <a href="">three</a> four';
$x = explode(" ", $str);
print_r($x);
/* Returns:
Array
(
[0] => one
[1] => two
[2] => <a
[3] => href="">three</a>
[4] => four
)
Looking for way to return:
Array
(
[0] => one
[1] => two
[2] => <a href="">three</a>
[3] => four
)
*/
Any ideas? Thanks
Upvotes: 1
Views: 216
Reputation: 38526
I wrote and tested this custom function. Give it a shot and let me know what you think.
function fireSplit($str) {
if (strpos($str,"<") === FALSE) return explode(" ",$str);
$str = trim($str);
$out = array();
$curIdx = 0;
$endIdx = strlen($str) -1;
while ($curIdx <= $endIdx) {
if (substr($str,$curIdx,1) == " ") {
$curIdx += 1;
continue;
}
$nextspace = strpos($str," ",$curIdx);
$nexttag = strpos($str,"<",$curIdx);
$nexttag2 = strpos($str,"/",$nexttag);
$nexttag3 = strpos($str,">",$nexttag2);
if ($nextspace === FALSE) {
$out[] = substr($str,$curIdx);
$curIdx = $endIdx + 1;
continue;
}
if ($nexttag !== FALSE && $nexttag < $nextspace && $nexttag2 !== FALSE && $nexttag3 !== FALSE) {
$out[] = substr($str,$curIdx,($nexttag3 - $curIdx + 1));
$curIdx = $nexttag3 + 1;
} else {
$out[] = substr($str,$curIdx,($nextspace - $curIdx));
$curIdx = $nextspace;
}
}
return $out;
}
I called:
fireSplit("one two <a href=\"haha\">three</a> four");
fireSplit("a <b>strong</b> c d e f");
It returned:
array(4) {
[0]=>
string(3) "one"
[1]=>
string(3) "two"
[2]=>
string(24) "<a href="haha">three</a>"
[3]=>
string(4) "four"
}
array(6) {
[0]=>
string(1) "a"
[1]=>
string(13) "<b>strong</b>"
[2]=>
string(1) "c"
[3]=>
string(1) "d"
[4]=>
string(1) "e"
[5]=>
string(1) "f"
}
Upvotes: 0
Reputation: 36
This is a bit simpler then the above, haven't fully tested but give it a shot.
$str = 'one two <a href="">three</a> four';
if(preg_match_all('%(<[^<]+.*?>|[^\s]+)%', $str, $matches)) {
array_shift($matches);
print_r($matches);
}
Here is another version that I tested for about 5 minutes that works a bit better:
$str = 'one two <a href="omfg hi I have spaces"> three</a> four <script type="javascript"> var a = "hello"; </script><random tag>la la la la<nested>hello?</nested></random tag>';
if(preg_match_all('%(<[^<]+.*?>|[^\s]+)%', preg_replace('%([\s]\<|\>[\s])%', '$1', $str), $matches)) {
array_shift($matches);
echo '<pre>';
print_r($matches);
echo '</pre>';
}
Upvotes: 2
Reputation: 27553
preg_split('/(<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>)|| /, $text)
That would sometimes work. It splits on either a tag-set, or else a space.
However, what you want is simply not that simple. You should cover all cases of nested tags, tags where content has a space ([a href]Foo Bar Baz[/a]), and so on. For that, you best implement a proper XML (html) parser.
But it seems to me you have purpose with that array. Is it to count "words"? If so, the solution would be a much simpler function call that strips all HTML from the text (strip_tags()) and then apply your wordsplitter and count them.
Upvotes: 2
Reputation: 35679
Could do a regex replace on the strings before and after you use explode.
so it would go into explode like
<a_href="">test</a>
Beyond any simple cases though you're talking about parsing HTML which is not a good thing to do with RegEx.
There's plenty of questions on here about parsing html on here. Perhaps you could adapt on of them.
Upvotes: 0