skylla
skylla

Reputation: 464

Turbo Grep - find special characters in UTF-8 file

I am running Windows 7 and (have to) use Turbo Grep (Borland something) to search in a file. I have 2 version of this file, one encoded in UTF-8 and one in ANSI.

If I run the following grep on the ANSI file, I get the expected results, but I get no results with the same statement on the UTF-8 file:

grep -ni "[äöü]" myfile.txt

[-n for line numbers, -i for ignoring cases]

The Turbo Grep Version is :

Turbo GREP 5.6 Copyright (c) 1992-2010 Embarcadero Technologies, Inc.
Syntax:  GREP [-rlcnvidzewoqhu] searchstring file[s] or @filelist
         GREP ? for help

Help for this command lists:

Options are one or more option characters preceded by "-", and optionally followed by "+" (turn option on), or "-" (turn it off). The default is "+". -r+ Regular expression search -l- File names only -c- match Count only -n- Line numbers -v- Non-matching lines only -i- Ignore case -d- Search subdirectories -z- Verbose -e Next argument is searchstring -w- Word search -o- UNIX output format Default set: [0-9A-Z_] -q- Quiet: supress normal output -h- Supress display of filename -u xxx Create a copy of grep named 'xxx' with current options set as default

A regular expression is one or more occurrences of: One or more characters optionally enclosed in quotes. The following symbols are treated specially: ^ start of line $ end of line . any character \ quote next character * match zero or more + match one or more [aeiou0-9] match a, e, i, o, u, and 0 thru 9 ; [^aeiou0-9] match anything but a, e, i, o, u, and 0 thru 9

Is there a problem with the encoding of these charactes in UTF-8? Might there be a problem with Turbo Grep and UTF-8?

Thanks in advance

Upvotes: 0

Views: 1200

Answers (1)

Victor Carrera
Victor Carrera

Reputation: 101

Yes there are a different w7 use UTF-16 little endian not UTF-8, UTF-8 is used in unix, linux and plan 9 for cite a few OS.

Jon Skeet explain:1

ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default code page for my system" which is obtained via Encoding.Default, and is often Windows-1252

UTF-8: Variable length encoding, 1-4 bytes covers every current character. ASCII values are encoded as ASCII.

UTF-16 is more similar to ANSI so for this reason with ANSI work well.

if you use only ascii both encodings are usable, but with special characters as ä ö ü etc you need use UTF-16 in windows and UTF-8 in the others

Upvotes: 1

Related Questions