Reputation: 5661
I am using a very simple sed script removing comments : sed -e 's/--.*$//'
It works great until non-ascii characters are present in a comment, e.g.: -- °
.
This line does not match the regular expression and is not substituted.
Any idea how to get .
to really match any character?
Solution :
Since file
says it is an iso8859 text, LANG
variable environment must be changed before calling sed
:
LANG=iso8859 sed -e 's/--.*//' -
Upvotes: 13
Views: 29028
Reputation: 5082
@julio-guerra: I ran into a similar situation, trying to delete lines like the folowing (note the Æ
character):
--MP_/yZa.b._zhqt9OhfqzaÆC
in a file, using
sed 's/^--MP_.*$//g' my_file
The file encoding indicated by the Linux file
command was
file my_file: ISO-8859 text, with very long lines
file -b my_file: ISO-8859 text, with very long lines
file -bi my_file: text/plain; charset=iso-8859-1
I tried your solution (clever!), with various permutations; e.g.,
LANG=ISO-8859 sed 's/^--MP_.*$//g' my_file
but none of those worked. I found two workarounds:
Perl
expression worked, i.e. deleted that line:perl -pe 's/^--MP_.*$//g' my_file
[For an explanation of the -pe
command-line switches, refer to this StackOverflow answer:
Perl flags -pe, -pi, -p, -w, -d, -i, -t? ]
Æ
character remained, but was now UTF8-encoded):iconv -f iso-8859-1 -t utf-8 my_file > my_file.utf8
As I am working with lots (1000's) of emails with various encodings, that undergo intermediate processing (bash-scripted conversions to UTF-8 do not always work), for my purposes "solution 1" above will probably be the most robust solution.
Notes:
Upvotes: 4
Reputation: 30880
The documentation of GNU sed's z
command mentions this effect (my emphasis):
This command empties the content of pattern space. It is usually the same as 's/.*//', but is more efficient and works in the presence of invalid multibyte sequences in the input stream. POSIX mandates that such sequences are not matched by '.', so that there is no portable way to clear sed's buffers in the middle of the script in most multibyte locales (including UTF-8 locales).
It seems likely that you are running sed in a UTF-8 (or other multibyte) locale. You'll want to set LC_CTYPE
(that's finer-grained than LANG
, and won't affect translation of error messages. Valid locale names usually look like en.iso88591
or (for the location in your profile) fr_FR.iso88591
, not just the encoding on its own - you might be able to see the full list with locale -a
.
Example:
LC_CTYPE=fr_FR.iso88591 sed -e 's/--.*//'
Alternatively, if you know that the non-comment parts of the line contain only ASCII, you could split the line at a comment marker, print the first part and discard the remainder:
sed -e 's/--/\n/' -e 'P' -e 'd'
Upvotes: 0
Reputation: 5982
It works for me. It's probably a character encoding problem.
This might help:
Upvotes: 5