Regular expression for utf-8 string sliceing at linebreaks or after a number of characters

Question

I found a function on the web, that uses a regular experssion, to iterate over a string and inserts linebreaks after a specified number of characters, so it will fit into a narrow table cell with a fixed width. here is the function:

/**
     * wordwrap for utf8 encoded strings
     *
     * @param string $str
     * @param integer $len
     * @param string $what
     * @return string
     * @author Milian Wolff 
     */

    function utf8_wordwrap($str, $width, $break, $cut = false) {

    if (!$cut || $_SESSION['wordwrap']) {
        $regexp = '#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){'.$width.'}#';
    } else {
            return $str; //if no wordwrap turned on, returns the original string
    }
    if (function_exists('mb_strlen')) {
        $str_len = mb_strlen($str,'UTF-8');
    } else {
        $str_len = preg_match_all('/[\x00-\x7F\xC0-\xFD]/', $str, $var_empty);
    }
    $while_what = ceil($str_len / $width);
    $i = 1;
    $return = '';
    while ($i < $while_what) {
        preg_match($regexp, $str,$matches);
        $string = $matches[0];
        $return .= $string.$break;
        $str = substr($str, strlen($string));
        $i++;
    }
    return $return.$str;
    }

here is the regexp:

#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){20}#

It does its job well, if it's combined with a while loop until there is a line break character in the string.

An example string:

1. first
2. second
3. third

The output of prag_match:

array (
  0 => '1. first
2. second
3',
)

so it just counts for the 20th character, and returns it.

What I would need is: To make it return everything until a new line char ( ) OR if there isn't any, return the first 20 char. So the output in this case would be something like this:

array (
      0 => '1. first',
      1 => '2. second',
      2 => '3. third'
    )

UPDATE: I tried Steve Robbins's answer and it worked perfectly, until the string had some spec UTF-8 characters in it. It's my fault, I didn't provide a decent example in the first place. Here is what it does:



And the output is:

array(8) {
  [0]=>
  string(8) "1. first"
  [1]=>
  string(9) "2. second"
  [2]=>
  string(8) "3. third"
  [3]=>
  string(20) "ez eg nyoulőűúú�"
  [4]=>
  string(20) "�3456789öüö987654"
  [5]=>
  string(13) "323456789öü"
  [6]=>
  string(3) "pam"
  [7]=>
  string(5) "papam"
}


http://codepad.org/Gt4CshXt

Juni · Accepted Answer

Thanks everyone for your efforts! I've found the solution here

",true));
function utf8_wordwrap($string, $width=20, $break="
", $cut=false)
{
  if($cut) {
    // Match anything 1 to $width chars long followed by whitespace or EOS,
    // otherwise match anything $width chars long
    $search = '/(.{1,'.$width.'})(?:\s|$)|(.{'.$width.'})/uS';
    $replace = '$1$2'.$break;
  } else {
    // Anchor the beginning of the pattern with a lookahead
    // to avoid crazy backtracking when words are longer than $width
    $pattern = '/(?=\s)(.{1,'.$width.'})(?:\s|$)/uS';
    $replace = '$1'.$break;
  }
  return preg_replace($search, $replace, $string);
}
?>
string '1. first

2. second

3. third

ez eg
nyoulőűúúú3456789öüö
987654323456789öü

pam

papam
' (length=122)

Regular expression for utf-8 string sliceing at linebreaks or after a number of characters

Answers (2)

Related Questions