Reputation: 155
I am currently looking into spliting a very long string that could contain HTML characteristics.
Once example is:
Thiiiissssaaaveryyyylonnngggstringgg
For this I have used this function in the past:
function split($sString, $iCount = 75)
{
$text = $sString;
$new_text = '';
$text_1 = explode('>',$text);
$sizeof = sizeof($text_1);
for ($i=0; $i<$sizeof; ++$i) {
$text_2 = explode('<',$text_1[$i]);
if (!empty($text_2[0])) {
$new_text .= preg_replace('#([^\n\r .]{'. $iCount .'})#iu', '\\1 ', $text_2[0]);
}
if (!empty($text_2[1])) {
$new_text .= '<' . $text_2[1] . '>';
}
}
return $new_text; }
The function works to pick up such characters and split them after X characters. The problem is when HTML or ASCII characters are mixed in there like this:
Thissssiisss<a href="#">lonnnggg</a>stingäää
I have been trying to figure out how to split this string above and to not count characters within HTML tags and to count each ASCII character as 1.
Any help would be great.
Thank you
Upvotes: 4
Views: 1382
Reputation: 99751
If you're worried about UTF-8 support for wordwrap
, then you want this:
function utf8_wordwrap($str, $width = 75, $break = "\n") // wordwrap() with utf-8 support {
$str = preg_split('#[\s\n\r]+#', $str);
$len = 0;
foreach ($str as $val) {
$val .= ' ';
$tmp = mb_strlen($val, 'utf-8');
$len += $tmp;
if ($len >= $width) {
$return .= $break . $val;
$len = $tmp;
}
else {
$return .= $val;
}
}
return $return;
}
Source: PHP Manual Comment
As to your issue with codepoints - you might want to look at html_entity_decode
, which I think converts codepoints (e.g. ß
) to the character they represent. You'll need to give it a charset so it knows what 223 means (since what '223' means depends on the charset).
Upvotes: 2
Reputation: 342625
Get rid of that complexity, use a DOM parser to extract the plain-text
//Dump contents (without tags) from HTML
$pageText = file_get_html('http://www.google.com/')->plaintext;
echo "Length is: " . strlen($pageText);
Upvotes: 0
Reputation: 33596
I use this function to split strings in FireStats.
you can probably take it out of context and use it pretty easily. note that it's calling some other functions. you can skip the utf8 check if you like.
Upvotes: 0