Reputation: 6015
I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:
grep -e "[\x{00FF}-\x{FFFF}]" file.xml
But this returns every line in the file, regardless of whether the line contains a character in the range specified.
Do I have the syntax wrong or am I doing something else wrong? I've also tried:
egrep "[\x{00FF}-\x{FFFF}]" file.xml
(with both single and double quotes surrounding the pattern).
Upvotes: 444
Views: 372655
Reputation: 2068
LC_ALL=C rg -v '[[:ascii:]]'
-v
or --invert-match
. Pipe to rg .
or to rg -v "^$"
to remove empty lines.
Here how to install.
Maybe I'm missing something, but I found this the most easy and fast alternative.
Upvotes: 0
Reputation: 17822
You can use the command:
LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" file.xml
This will give you the line number, and will highlight non-ascii chars in red.
In some systems, depending on your settings, the above will not work, so you can grep by the inverse
LC_ALL=C grep --color='auto' -P -n "[^\x00-\x7F]" file.xml
Note also, that the important bit is the -P
flag which equates to --perl-regexp
: so it will interpret your pattern as a Perl regular expression. It also says that
this is highly experimental and grep -P may warn of unimplemented features.
Upvotes: 597
Reputation: 107899
The easy way is to define a non-ASCII character... as a character that is not an ASCII character.
LC_ALL=C grep '[^ -~]' file.xml
The code above looks for characters that are not printable ASCII characters: non-ASCII characters, and control characters. Add a tab after the ^
if there might be tabs in the file. Add a carriage return if there might be Windows line endings that you don't want to match. In bash or zsh, you can use $'…'
quoting and \t
for a tab, \r
for a carriage return.
LC_ALL=C grep $'[^\t\r -~]' file.xml
With other shells that don't support $'…'
, interactively, you can insert control characters literally with e.g. Ctrl+V Ctrl+M. In a script, you might prefer not to include the control character literally in the script, and instead generate it at runtime.
control_characters=$(printf '\t\r')
LC_ALL=C grep "[^${control_characters} -~]" file.xml
To avoid matching any control character, use the range of ASCII characters (excluding null which can't be specified on the command line). With GNU grep, control characters normally result in the message “binary file matches” instead of printing out matches; pass the --text
option to display control characters in the output.
LC_ALL=C grep --text $'[^\001-~]' file.xml
Setting LC_COLLATE=C
avoids nasty surprises about the meaning of character ranges in many locales. Setting LC_CTYPE=C
is necessary to match single-byte characters — otherwise the command would miss invalid byte sequences in the current encoding. Setting LC_ALL=C
avoids locale-dependent effects altogether.
Upvotes: 78
Reputation: 10900
This works for me:
Command:
LC_ALL=C grep --color='auto' -obnP "[\x80-\xFF]" file.xml
Output:
868:31879:�
868:106287:�
868:106934:�
868:107349:�
868:254456:�
868:254678:�
868:286403:�
870:315585:�
870:389741:�
870:390388:�
870:390803:�
870:537910:�
870:538132:�
870:569811:�
870:598916:�
870:673324:�
870:673971:�
870:674386:�
870:821493:�
870:821715:�
870:853440:�
871:882578:�
871:956734:�
871:957381:�
871:957796:�
871:1104903:�
871:1105125:�
871:1136804:�
Command:
# Splitting the output of grep to ':'. Then printing the first 2 tokens and passing the 3rd one from xxd to convert to byte hex
LC_ALL=C grep --color='auto' -obnP "[\x80-\xFF]" file.xml |\
xargs -I{} bash -c "echo {}|awk 'BEGIN { FS = \":\" };{printf \"%s:%s:\",\$1, \$2; print \$3 | \"xxd -p -l1\" }'"
Output:
868:31879:96
868:106287:92
868:106934:92
868:107349:92
868:254456:92
868:254678:92
868:286403:92
870:315585:96
870:389741:92
870:390388:92
870:390803:92
870:537910:92
870:538132:92
870:569811:92
870:598916:96
870:673324:92
870:673971:92
870:674386:92
870:821493:92
870:821715:92
870:853440:92
871:882578:96
871:956734:92
871:957381:92
871:957796:92
871:1104903:92
871:1105125:92
871:1136804:92
Notes:
LC_ALL=C
in front it does not work in Ubuntu 22.04-o
: only match, -b
: matched byte offset, -l
: matched line number -P
: perl-regexpUpvotes: 0
Reputation: 285
This method should work with any POSIX-compliant version of awk
and iconv
.
We can take advantage of file
and tr
as well.
Solutions above may be better in some cases, but they seem to depend on GNU/Linux implementations or additional tools.
Just get a sample file somehow:
$ curl -LOs http://gutenberg.org/files/84/84-0.txt
$ file 84-0.txt
84-0.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators
Search for UTF-8 characters:
$ awk '/[\x80-\xFF]/ { print }' 84-0.txt
or non-ASCII (not POSIX after all, see possible solution below)
$ awk '/[^[:ascii:]]/ { print }' 84-0.txt
Convert UTF-8 to ASCII, removing problematic characters (including BOM which should not be in UTF-8 anyway):
$ iconv -c -t ASCII 84-0.txt > 84-ascii.txt
Check it:
$ file 84-ascii.txt
84-ascii.txt: ASCII text, with CRLF line terminators
Tweak it to remove DOS line endings / ^M
("CRLF line terminators"):
$ tr -d '\015' < 84-ascii.txt > 84-tweaked.txt && file 84-tweaked.txt
84-tweaked.txt: ASCII text
This method discards any "bad" characters it cannot deal with, so you may need to sanitize / validate the output. YMMV
>> UPDATE << I have been using something closer to this lately:
$ LC_ALL=C tr -d '[:print:]' < 84-0.txt | fold -w 1 | sort -u | sed -n l
But I am not sure of how portable it is but it gives me the option to automate swapping out characters or strings.
I do not have quick access to a real UNIX right now, but I think those are all POSIX-compliant options and switches. I do know it is pretty fast. YMMV.
Upvotes: 1
Reputation: 2915
nawk '/[\200-\377]/'
mawk '/[\200-\377]/'
gawk -b '/[\200-\377]/'
gawk -e '!/^[\0-\177]*$/'
in gawk
unicode mode just doing /[^\0-\177]/ is insufficient cuz it misses all the poorly-formed sequences and/or arbitrary bytes like \371
otherwise, you have to list all 128 bytes out in alternation form, and it's hideous
Upvotes: 0
Reputation: 18998
Strangely, I had to do this today! I ended up using Perl because I couldn't get grep/egrep to work (even in -P mode). Something like:
cat blah | perl -en '/\xCA\xFE\xBA\xBE/ && print "found"'
For unicode characters (like \u2212
in example below) use this:
find . ... -exec perl -CA -e '$ARGV = @ARGV[0]; open IN, $ARGV; binmode(IN, ":utf8"); binmode(STDOUT, ":utf8"); while (<IN>) { next unless /\N{U+2212}/; print "$ARGV: $&: $_"; exit }' '{}' \;
Upvotes: 1
Reputation: 329
Finding all non-ascii characters gives the impression that one is either looking for unicode strings or intends to strip said characters individually.
For the former, try one of these (variable file
is used for automation):
file=file.txt ; LC_ALL=C grep -Piao '[\x80-\xFF\x20]{7,}' $file | iconv -f $(uchardet $file) -t utf-8
file=file.txt ; pcregrep -iao '[\x80-\xFF\x20]{7,}' $file | iconv -f $(uchardet $file) -t utf-8
file=file.txt ; pcregrep -iao '[^\x00-\x19\x21-\x7F]{7,}' $file | iconv -f $(uchardet $file) -t utf-8
Vanilla grep doesn't work correctly without LC_ALL=C as noted in the previous answers.
ASCII range is x00-x7F
, space is x20
, since strings have spaces the negative range omits it.
Non-ASCII range is x80-xFF
, since strings have spaces the positive range adds it.
String is presumed to be at least 7 consecutive characters within the range. {7,}
.
For shell readable output, uchardet $file
returns a guess of the file encoding which is passed to iconv for automatic interpolation.
Upvotes: 1
Reputation: 51
It could be interesting to know how to search for one unicode character. This command can help. You only need to know the code in UTF8
grep -v $'\u200d'
Upvotes: 1
Reputation: 101
The following code works:
find /tmp | perl -ne 'print if /[^[:ascii:]]/'
Replace /tmp
with the name of the directory you want to search through.
Upvotes: 10
Reputation: 4398
Searching for non-printable chars. TLDR; Executive Summary
LC_ALL=C
needed to make grep do what you might expect with extended unicodeSO the preferred non-ascii char finders:
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
as in top answer, the inverse grep:
$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
as in top answer but WITH LC_ALL=C
:
$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
. . more . . excruciating detail on this: . . .
I agree with Harvey above buried in the comments, it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable. Harvey suggests "use this: "[^\n -~]
". Add \r for DOS text files. That translates to "[^\x0A\x020-\x07E]
" and add \x0D for CR"
Also, adding -c (show count of patterns matched) to grep is useful when searching for non-printable chars as the strings matched can mess up terminal.
I found adding range 0-8 and 0x0e-0x1f (to the 0x80-0xff range) is a useful pattern. This excludes the TAB, CR and LF and one or two more uncommon printable chars. So IMHO a quite a useful (albeit crude) grep pattern is THIS one:
grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *
ACTUALLY, generally you will need to do this:
LC_ALL=C grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *
breakdown:
LC_ALL=C - set locale to C, otherwise many extended chars will not match (even though they look like they are encoded > 0x80)
\x00-\x08 - non-printable control chars 0 - 7 decimal
\x0E-\x1F - more non-printable control chars 14 - 31 decimal
\x80-1xFF - non-printable chars > 128 decimal
-c - print count of matching lines instead of lines
-P - perl style regexps
Instead of -c you may prefer to use -n (and optionally -b) or -l
-n, --line-number
-b, --byte-offset
-l, --files-with-matches
E.g. practical example of use find to grep all files under current directory:
LC_ALL=C find . -type f -exec grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" {} +
You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.
Non-Printable ASCII Chars ** marks PRINTABLE but CONTROL chars that is useful to exclude sometimes Dec Hex Ctrl Char description Dec Hex Ctrl Char description 0 00 ^@ NULL 16 10 ^P DATA LINK ESCAPE (DLE) 1 01 ^A START OF HEADING (SOH) 17 11 ^Q DEVICE CONTROL 1 (DC1) 2 02 ^B START OF TEXT (STX) 18 12 ^R DEVICE CONTROL 2 (DC2) 3 03 ^C END OF TEXT (ETX) 19 13 ^S DEVICE CONTROL 3 (DC3) 4 04 ^D END OF TRANSMISSION (EOT) 20 14 ^T DEVICE CONTROL 4 (DC4) 5 05 ^E END OF QUERY (ENQ) 21 15 ^U NEGATIVE ACKNOWLEDGEMENT (NAK) 6 06 ^F ACKNOWLEDGE (ACK) 22 16 ^V SYNCHRONIZE (SYN) 7 07 ^G BEEP (BEL) 23 17 ^W END OF TRANSMISSION BLOCK (ETB) 8 08 ^H BACKSPACE (BS)** 24 18 ^X CANCEL (CAN) 9 09 ^I HORIZONTAL TAB (HT)** 25 19 ^Y END OF MEDIUM (EM) 10 0A ^J LINE FEED (LF)** 26 1A ^Z SUBSTITUTE (SUB) 11 0B ^K VERTICAL TAB (VT)** 27 1B ^[ ESCAPE (ESC) 12 0C ^L FF (FORM FEED)** 28 1C ^\ FILE SEPARATOR (FS) RIGHT ARROW 13 0D ^M CR (CARRIAGE RETURN)** 29 1D ^] GROUP SEPARATOR (GS) LEFT ARROW 14 0E ^N SO (SHIFT OUT) 30 1E ^^ RECORD SEPARATOR (RS) UP ARROW 15 0F ^O SI (SHIFT IN) 31 1F ^_ UNIT SEPARATOR (US) DOWN ARROW
UPDATE: I had to revisit this recently. And, YYMV depending on terminal settings/solar weather forecast BUT . . I noticed that grep was not finding many unicode or extended characters. Even though intuitively they should match the range 0x80 to 0xff, 3 and 4 byte unicode characters were not matched. ??? Can anyone explain this? YES. @frabjous asked and @calandoa explained that LC_ALL=C
should be used to set locale for the command to make grep match.
e.g. my locale LC_ALL=
empty
$ locale
LANG=en_IE.UTF-8
LC_CTYPE="en_IE.UTF-8"
.
.
LC_ALL=
grep with LC_ALL=
empty matches 2 byte encoded chars but not 3 and 4 byte encoded:
$ grep -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" notes_unicode_emoji_test
5:© copyright c2a9
7:call underscore c2a0
9:CTRL
31:5 © copyright
32:7 call underscore
grep with LC_ALL=C
does seem to match all extended characters that you would want:
$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
1:���� unicode dashes e28090
3:��� Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5:� copyright c2a9
7:call� underscore c2a0
11:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29:1 ���� unicode dashes
30:3 ��� Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31:5 � copyright
32:7 call� underscore
33:11 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other
34:52 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other
81:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other
THIS perl match (partially found elsewhere on stackoverflow) OR the inverse grep on the top answer DO seem to find ALL the ~weird~ and ~wonderful~ "non-ascii" characters without setting locale:
$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
1 ‐‐ unicode dashes e28090
3 💘 Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5 © copyright c2a9
7 call underscore c2a0
9 CTRL-H CHARS URK URK URK
11 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29 1 ‐‐ unicode dashes
30 3 💘 Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31 5 © copyright
32 7 call underscore
33 11 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other
34 52 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other
73 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other
SO the preferred non-ascii char finders:
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
as in top answer, the inverse grep:
$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
as in top answer but WITH LC_ALL=C
:
$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
Upvotes: 22
Reputation: 1945
In perl
perl -ane '{ if(m/[[:^ascii:]]/) { print } }' fileName > newFile
Upvotes: 62
Reputation: 14746
The following works for me:
grep -P "[\x80-\xFF]" file.xml
Non-ASCII characters start at 0x80 and go to 0xFF when looking at bytes. Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want. The -P
option in my grep allows the use of \xdd
escapes in character classes to accomplish what you want.
Upvotes: 68
Reputation: 4798
Instead of making assumptions about the byte range of non-ASCII characters, as most of the above solutions do, it's slightly better IMO to be explicit about the actual byte range of ASCII characters instead.
So the first solution for instance would become:
grep --color='auto' -P -n '[^\x00-\x7F]' file.xml
(which basically greps for any character outside of the hexadecimal ASCII range: from \x00 up to \x7F)
On Mountain Lion that won't work (due to the lack of PCRE support in BSD grep), but with pcre
installed via Homebrew, the following will work just as well:
pcregrep --color='auto' -n '[^\x00-\x7F]' file.xml
Any pros or cons that anyone can think off?
Upvotes: 158
Reputation: 2915
UPDATE 1 : changing main awk
code from 9
to NF
to handle leading and trailing edge ASCII
s
Keep it simple with awk
- leverage RS
for hands-free driving - no locale adjustments required :
__=$'123=pqr:\303\606?\414#45&6\360\641\266\666]>^{(\13xyz'
printf '%s' "$__" | od
0000000 1026765361 980578672 205489859 641020963
1 2 3 = p q r : Æ ** ? \f # 4 5 &
061 062 063 075 160 161 162 072 303 206 077 014 043 064 065 046
1 2 3 = p q r : ? 86 ? ff # 4 5 &
49 50 51 61 112 113 114 58 195 134 63 12 35 52 53 38
31 32 33 3d 70 71 72 3a c3 86 3f 0c 23 34 35 26
0000020 3064066102 1581145526 2013997179 31353
6 𡶶 ** ** ** ] > ^ { ( \v x y z
066 360 241 266 266 135 076 136 173 050 013 170 171 172
6 ? ? ? ? ] > ^ { ( vt x y z
54 240 161 182 182 93 62 94 123 40 11 120 121 122
36 f0 a1 b6 b6 5d 3e 5e 7b 28 0b 78 79 7a
0000036
printf '%s' "$__"
123=pqr:Æ?
#45&6𡶶]>^{(
xyz
mawk NF RS='[\0-\577]+' | gcat -b
1 Æ
2 𡶶
Set a custom ORS
for single-line output :
gawk NF RS='[\0-\577]+' ORS='|' | gcat -b
Æ|𡶶|
If you insist on using nawk
, then you need to modify the RS
to ...
nawk NF RS='(\\0|[\1-\177]+)+'
... since nawk
has issues handling either \0
or \\0
within a char class, it must be taken out of [...]
and be replaced with an disturbingly verbose alternation
Upvotes: 0
Reputation: 2268
Here is another variant I found that produced completely different results from the grep search for [\x80-\xFF]
in the accepted answer. Perhaps it will be useful to someone to find additional non-ascii characters:
grep --color='auto' -P -n "[^[:ascii:]]" myfile.txt
Note: my computer's grep (a Mac) did not have -P
option, so I did brew install grep
and started the call above with ggrep
instead of grep
.
Upvotes: 31