Patrik Johansson
Patrik Johansson

Reputation: 155

How to split a long string with PHP?

I am currently looking into spliting a very long string that could contain HTML characteristics.

Once example is:

Thiiiissssaaaveryyyylonnngggstringgg

For this I have used this function in the past:

function split($sString, $iCount = 75)
{       
    $text = $sString;
    $new_text = '';
    $text_1 = explode('>',$text);
    $sizeof = sizeof($text_1);
    for ($i=0; $i<$sizeof; ++$i) {
        $text_2 = explode('<',$text_1[$i]);
        if (!empty($text_2[0])) {

                $new_text .= preg_replace('#([^\n\r .]{'. $iCount .'})#iu', '\\1  ', $text_2[0]);
        }
        if (!empty($text_2[1])) {
            $new_text .= '<' . $text_2[1] . '>';
        }
    }
    return $new_text; }

The function works to pick up such characters and split them after X characters. The problem is when HTML or ASCII characters are mixed in there like this:

Thissssiisss<a href="#">lonnnggg</a>sting&#228;&#228;&#228;

I have been trying to figure out how to split this string above and to not count characters within HTML tags and to count each ASCII character as 1.

Any help would be great.

Thank you

Upvotes: 4

Views: 1382

Answers (4)

Dominic Rodger
Dominic Rodger

Reputation: 99751

If you're worried about UTF-8 support for wordwrap, then you want this:

function utf8_wordwrap($str, $width = 75, $break = "\n") // wordwrap() with utf-8 support {
    $str = preg_split('#[\s\n\r]+#', $str);
    $len = 0;
    foreach ($str as $val) {
        $val .= ' ';
        $tmp = mb_strlen($val, 'utf-8');
        $len += $tmp;
        if ($len >= $width) {
            $return .= $break . $val;
            $len = $tmp;
        }
        else {
            $return .= $val;
        }
    }
    return $return;
}

Source: PHP Manual Comment

As to your issue with codepoints - you might want to look at html_entity_decode, which I think converts codepoints (e.g. &#223) to the character they represent. You'll need to give it a charset so it knows what 223 means (since what '223' means depends on the charset).

Upvotes: 2

karim79
karim79

Reputation: 342625

Get rid of that complexity, use a DOM parser to extract the plain-text

//Dump contents (without tags) from HTML
$pageText = file_get_html('http://www.google.com/')->plaintext;
echo "Length is: " . strlen($pageText); 

Upvotes: 0

Omry Yadan
Omry Yadan

Reputation: 33596

I use this function to split strings in FireStats.

you can probably take it out of context and use it pretty easily. note that it's calling some other functions. you can skip the utf8 check if you like.

Upvotes: 0

Amber
Amber

Reputation: 526533

Consider using the built-in wordwrap() instead?

Upvotes: 2

Related Questions