Find non-ASCII text in a file

Question

I am trying to find the Greek word μάθηση in a file, which in Unicode characters is \u03bc\u03ac\u03b8\u03b7\u03c3\u03b7 using grep. I tried this command

grep -r $"\u03bc\u03ac\u03b8\u03b7\u03c3\u03b7" filename.txt

but it failed. Any help?

Walter Tross · Accepted Answer

this works on my Mac with zsh:

fgrep "$(echo '\u03bc\u03ac\u03b8\u03b7\u03c3\u03b7')" filename.txt

and the following works on my Mac with bash 3.2.57 (for those who don't know: Apple switched to zsh instead of switching to bash version 4, because of licensing concerns)

fgrep "$(echo -e '\xce\xbc\xce\xac\xce\xb8\xce\xb7\xcf\x83\xce\xb7')" filename.txt

The builtin version of echo in bash (which you can read about with man bash, not with man echo) needs the -e option to expand certain escape sequences (\x in this case), but \u (Unicode) is not among these. I don't know whether this is different in newer versions of bash.

To find the UTF-8 hex representation of the search string I did an od -tx1 of a text file where I had written μάθηση. Of course, here I'm supposing your file is UTF-8-encoded.

The following should always work, though:^(*)

Write μάθηση in a 1-line file, say it's called grepfile.txt, then

fgrep -f grepfile.txt filename.txt

(tested on Mac with bash and zsh)

(*): This solution should work as long as the encoding of both files is the same (you can check the encoding with the file command, keeping in mind that 7-bit ASCII is a subset of UTF-8, but also of all ISO-8859-* encodings).

Find non-ASCII text in a file

Answers (1)

Related Questions