Reputation: 1161

Using ucwords for non-english characters

Currently, I'm using a ucwords-related function to make capital letters after hyphens, dots and apostrophes:

function ucwordsMore ($str){
    $str = ucwords($str);
    $str = str_replace('- ','-',ucwords(str_replace('-','- ',$str)));  // hyphens
    $str = str_replace('. ','.',ucwords(str_replace('.','. ',$str)));  // dots
    $str = preg_replace("/\w[\w']*/e", "ucwords('\\0')", $str);        // apostrophes

    return $str;
}

It works fine to english letters. However, non-english letters are not recognized properly. For instance this text:

La dernière usine française d'accordéons reste à Tulle

is turned into this text:

La DernièRe Usine FrançAise D'accordéOns Reste à Tulle

But I need it to be:

La Dernière Usine Française D'Accordéons Reste À Tulle

Any ideas?

Upvotes: 2

Answers (4)

Edson Medina

Reputation: 10279

Use this:

function mb_ucwords ($string)
{
    return mb_convert_case ($string, MB_CASE_TITLE, 'UTF-8'); 
}

Upvotes: 1

user557597

Reputation:

As @Jon mentioned, you need to use locale which implements relationships between upper/lower caseing that affects function calls that use that. Typically it is LC_CTYPE.

There are constants for numeric behavior, sorting, monetary and others too. Locale needs to be installed on your machine, or be available via plugins or modules, etc. Read up on that.

I don't know php locale at all so here is a sample in Perl that uses a regex approach different than yours. I couldn't figure out your solution so well, hopefully you can get some ideas from mine.

use locale;
use POSIX qw(locale_h);

setlocale(LC_CTYPE, "en_US");

$str = "La dernière usine française d'accordéons reste à Tulle";

$str =~ s/ (?:^|(?<=\s)|(?<=\w-)|(?<=\w\.)|(?<=\w\')) (\w) / uc($1) /xeg;

print "$str\n";

Output

La Dernière Usine Française D'Accordéons Reste À Tulle

Regex

Form is s///  find and replace

s/                  # Search

  (?:                  # Group
      ^                   # beginning of string
    | (?<=\s)             # or, lookbehind \s
    | (?<=\w-)            # or, lookbehind \w-
    | (?<=\w\.)           # or, lookbehind \w\.
    | (?<=\w\')           # or, lookbehind \w\'
  )                    # End group
  (\w)                 # Capture group 1, a single word char

/                   # Replace
  uc($1)               # Upercased word char from capt grp 1

/xeg;               # Modifiers x(expanded), e(eval), g(global)

Upvotes: 2

matino

Reputation: 17725

Have a look at Kohana UTF8 class - http://kohanaframework.org/3.2/guide/api/UTF8

Upvotes: 0

Jon

Reputation: 437704

You probably need to use setlocale for LC_CTYPE before such conversions will be done correctly, but there is also the issue of what encoding your string is in. ucwords is only meant to work on single-byte-encoded text.

Upvotes: 1

Using ucwords for non-english characters

Answers (4)

Related Questions