Reputation: 2932

grep and utf-8 encoded umlauts

I am running Ubuntu and echo $LANG tells me that I am using UTF-8: "en_US.UTF-8".

I created a directory with one file called 'ö' (a german umlaut)

ronald@lala:~/tempX/test$ ls
ö

My understanding is that because of the utf-8 encoding the file-name consists of two bytes representing one character. Therefore I am surprised that this matches:

ronald@lala:~/tempX/test$ ls | grep "^\W\W$"
ö
ronald@lala:~/tempX/test$ ls | egrep "^\W{2,}$"
ö
ronald@lala:~/tempX/test$ ls | grep -P "^\W{2,}$"
ö
ronald@lala:~/tempX/test$ ls | pcregrep "^\W{2,}$"
ö

Why is grep regarding 'ö' as two non-word-characters and not just one?

Best regards, Ronald

Upvotes: 6

Answers (4)

Mark G.

Reputation: 2968

Short answer:

The proper locale files need to be present in addition to having the proper environment variable(s) set before grep can correctly interpret non-ASCII text. Run locale-gen en_US.UTF-8 followed by export LANG="en_US.UTF-8" and you should be good-to-go. If that doesn’t work (or if you don’t have locale-gen installed) try export LANG=C.UTF-8.

Long answer:

Example of the problem:

$ O_WITH_UMLAUT="ö"

$ printf "%s" "$O_WITH_UMLAUT" | grep -E '^[^\w]$'

$ printf "%s" "$O_WITH_UMLAUT" | grep -E '^[^\w]{2}$'
ö

The first attempt produces no output, but as soon as you ask grep to search for TWO non-word-characters in a row, there it is…

This behavior occurs because non-ASCII characters use a multi-byte encoding scheme (which should almost always be UTF-8 in this day and age, but ancient/obsolescent systems may be using a more exotic encoding).

$ printf "%s" "$O_WITH_UMLAUT" | od -Ax -tx1
000000 c3 b6
000002

Note: If your terminal emulator doesn’t allow you to paste an 'ö' because of related encoding issues, then you can still place one into an environment variable like this to test things out: O_WITH_UMLAUT=$(printf "\xC3\xB6")

The usual recommendation to address this issue is to set the LANG environment variable (which acts as a fallback to the LC_* environment variables) to something like en_US.UTF-8 (or en_GB.UTF-8, pl_PL.UTF-8, ru_RU.UTF-8, C.UTF-8, what-have-you, etc.…) so that grep can know what encoding it should be expecting for input data:

$ export LANG="en_US.UTF-8"

…However, what if that doesn’t work?

$ printf "%s" "$O_WITH_UMLAUT" | grep -E '^[^\w]$'

$ printf "%s" "$O_WITH_UMLAUT" | grep -E '^[^\w]{2}$'
ö

In that case, the first thing to check is the output of locale:

$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Looks like some of the locale files are missing.

The first paragraph of the locale-gen manpage explains why:

Compiled locale files take about 50MB of disk space, and most users only need few locales. In order to save disk space, compiled locale files are not distributed in the locales package, but selected locales are automatically generated when this package is installed by running the locale-gen program.

So, all we have to do is:

$ locale-gen en_US.UTF-8
Generating locales (this might take a while)...
  en_US.UTF-8... done

$ locale  # no more warnings!
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

$ printf "%s" "$O_WITH_UMLAUT" | grep -E '^[^\w]$'  # works as it should!
ö

…However, what if that doesn’t work?

$ locale-gen en_US.UTF-8
bash: locale-gen: command not found

In desperation, you can try C.UTF-8, which should be readily available almost anywhere:

$ export LANG="C.UTF-8"

$ printf "%s" "$O_WITH_UMLAUT" | grep -E '^[^\w]$'
ö

If that still doesn’t work, you can try setting LC_ALL (which acts as a heavy-handed override) instead of LANG (which, as mentioned earlier, acts as merely a fallback).

Final Addendum:

In your case, your non-ASCII data isn’t coming from an environment variable, but a directory on the filesystem, (or, to be more specific, ls’s chosen textual representation of that directory’s name…) so it would be good to be aware that some filesystems (or their APIs, or tools like ls…) will store/produce information differently than you may expect, which could cause similar (but unrelated) issues.

For example, consider the following, performed on a Linux system:

$ mkdir -p /tmp/dirs
$ cd /tmp/dirs
$ python -i

>>> import os
>>> os.getcwd()
'/tmp/dirs'
>>> os.listdir('.')
[]
>>> # Create a directory with this name:
>>> # U+00F6: LATIN SMALL LETTER O WITH DIAERESIS
>>> # (total Unicode code-points: 1)
>>> os.makedirs('\xc3\xb6')
>>> os.listdir('.')
['\xc3\xb6']
>>> # Now create a directory with *this* name:
>>> # U+006F: LATIN SMALL LETTER O (ASCII)
>>> # followed by U+00A8: DIAERESIS (non-ASCII modifier)
>>> # (total Unicode code-points: 2)
>>> os.makedirs('o\xcc\x88')
>>> os.listdir('.')
['\xc3\xb6', 'o\xcc\x88']
>>> exit()

$ ls | grep -E '^[^\w]$'
ö

$ ls | grep -E '^[^\w]{2}$'
ö

$ ls -Fl
total 8
drwxr-xr-x 2 docker docker 4096 May 15 20:52 ö/
drwxr-xr-x 2 docker docker 4096 May 15 20:51 ö/

(How’s that for confusing?!)

And now, the same thing, on a Mac OS X (HFS+) system, which — thankfully — disallows such shenanigans, but at the expense of your files/directories perhaps not being represented in quite the way you might expect:

>>> import os
>>> os.getcwd()
'/private/tmp/dirs'
>>> os.listdir('.')
[]
>>> os.makedirs('\xc3\xb6')
>>> os.listdir('.')
['o\xcc\x88']  # ...that's not what we asked it to create...
>>> os.makedirs('o\xcc\x88')
OSError: [Errno 17] File exists: 'o\xcc\x88'
>>> os.makedirs('\xc3\xb6')
OSError: [Errno 17] File exists: '\xc3\xb6'
>>> exit()

$ ls | grep -E '^[^\w]$'  # nothing...

$ ls | grep -E '^[^\w]{2}$'  # there it is.
ö

So, once you’re sure your locale is set up and functioning properly, if your regexes still aren’t working, the next thing to check would be to make sure your filesystem (or your build of ls, or whatever other utilities you’re using in your grep pipeline) aren’t transcoding your stuff behind-the-scenes. (I could weave a yarn about MinGW/MSYS utilities and NTFS/exFAT that would bore about as much hair out of your head as I pulled out of my own during that particular escapade… but, I digress.)

Hope that helps!

Further reading:

Upvotes: 5

OrangeFish

Reputation: 32

Yes you right @Ronald, something wrong with grep and Unicode. According to man grep:

The symbol \w is a synonym for [_[:alnum:]] and \W is a synonym for [^_[:alnum:]].

But this synonym does not work.

LANG=ru_RU.UTF-8
$ echo Ю | egrep \w
(nothing)
$ echo Ю | egrep [_[:alnum:]]
Ю
$ echo Ю | egrep '\W\W'
Ю
$ egrep -V
egrep (GNU grep) 2.16

Upvotes: 0

Karol S

Reputation: 9457

Grep works on character level and takes into account encoding and collation of your current locale (it's documented in the manpages). You can force it to use ASCII by switching to C locale.

Using pl_PL.UTF-8:

$ echo Ź | grep -i ź
Ź
$ echo ó | grep '[a-z]'
ó
$ echo ó | grep '^..$'
(nothing)

Using C:

$ echo Ź | LC_ALL=C grep -i ź
(nothing)
$ echo ó | LC_ALL=C grep '[a-z]'
(nothing)
$ echo ó | LC_ALL=C grep '^..$'
ó

Upvotes: 2

Mark Setchell

Reputation: 208043

Unconventional "answer", but my answer is that your Ubuntu is broken, or you need to use the same locale as me! I am using OSX Mavericks.

ls ??
<nothing>

ls ?
¨

ls ?| xxd
0000000: c2a8 0a                                  ...

ls | grep "^\W\W$"
<nothing>

ls | grep "^\W$"
¨

echo $LANG
en_GB.UTF-8