Phelic
Phelic

Reputation: 61

Strange umlaut encoding on file system

From time to time I encounter files that have a strange (wrong?) encoding of umlaut characters in their file names. Maybe the encoding comes from a Mac system, but I'm not sure. I work with Windows.

For example: Volkszählung instead of Volkszählung (try to use Backspace after the first ä).

When pasting it into an ANSI encoded file with notepad++ it inserts Volksza¨hlung.

I have two questions:
a) Where does that come from and which encoding is it?
b) Using glob() in PHP does not list these files when using the wildchard character *. How is it possible to detect them in PHP?

Upvotes: 3

Views: 630

Answers (1)

Wyzard
Wyzard

Reputation: 34573

That's a combining character: specifically, U+0308 COMBINING DIARESIS. Combining characters are what let you put things like umlauts on any character, not just specific "precomposed" characters with built-in umlauts like U+00E4 LATIN SMALL LETTER A WITH DIAERESIS. Although it's not necessary to use a combining character in this case (since a suitable precomposed character exists), it's not wrong either.

(Note, this isn't an "encoding" at all: in the context of Unicode, an encoding is a method for transforming Unicode codepoint numbers into byte sequences so they can be stored in a file. UTF-8 and UTF-16 are encodings. But combining characters are Unicode codepoints, just like normal characters; they're not something produced by the encoding process.)

If you're working with Unicode text, you should be using PHP's mbstring functions. The built-in string functions aren't Unicode-aware, and see strings only as sequences of bytes rather than sequences of characters. I'm not sure how mbstring treats combining characters, though; the documentation doesn't mention them at all, as far as I can see.

You should also take a look at the grapheme functions, which are specifically meant to cope with combining characters. A "grapheme unit" is the single visual character produced by a base character codepoint plus any combining characters that follow it.

Finally, the PCRE regex functions support a \X escape sequence that matches whole grapheme clusters rather than individual codepoints.

Upvotes: 3

Related Questions