Reputation: 5661

Does . really match any character?

I am using a very simple sed script removing comments : sed -e 's/--.*$//'

It works great until non-ascii characters are present in a comment, e.g.: -- °. This line does not match the regular expression and is not substituted.

Any idea how to get . to really match any character?

Solution :

Since file says it is an iso8859 text, LANG variable environment must be changed before calling sed : LANG=iso8859 sed -e 's/--.*//' -

Upvotes: 13

Answers (3)

Victoria Stuart

Reputation: 5082

@julio-guerra: I ran into a similar situation, trying to delete lines like the folowing (note the Æ character):

--MP_/yZa.b._zhqt9OhfqzaÆC

in a file, using

sed 's/^--MP_.*$//g' my_file

The file encoding indicated by the Linux file command was

    file my_file: ISO-8859 text, with very long lines
 file -b my_file: ISO-8859 text, with very long lines
file -bi my_file: text/plain; charset=iso-8859-1

I tried your solution (clever!), with various permutations; e.g.,

LANG=ISO-8859 sed 's/^--MP_.*$//g' my_file

but none of those worked. I found two workarounds:

The following Perl expression worked, i.e. deleted that line:

perl -pe 's/^--MP_.*$//g' my_file

[For an explanation of the -pe command-line switches, refer to this StackOverflow answer:

Perl flags -pe, -pi, -p, -w, -d, -i, -t? ]

Alternatively, after converting the file encoding to UTF-8, the sed expression worked (the Æ character remained, but was now UTF8-encoded):

iconv -f iso-8859-1 -t utf-8 my_file > my_file.utf8

As I am working with lots (1000's) of emails with various encodings, that undergo intermediate processing (bash-scripted conversions to UTF-8 do not always work), for my purposes "solution 1" above will probably be the most robust solution.

Notes:

sed (GNU sed) 4.4
perl v5.26.1 built for x86_64-linux-thread-multi
Arch Linux x86_64 system

Upvotes: 4

Toby Speight

Reputation: 30880

The documentation of GNU sed's z command mentions this effect (my emphasis):

This command empties the content of pattern space. It is usually the same as 's/.*//', but is more efficient and works in the presence of invalid multibyte sequences in the input stream. POSIX mandates that such sequences are not matched by '.', so that there is no portable way to clear sed's buffers in the middle of the script in most multibyte locales (including UTF-8 locales).

It seems likely that you are running sed in a UTF-8 (or other multibyte) locale. You'll want to set LC_CTYPE (that's finer-grained than LANG, and won't affect translation of error messages. Valid locale names usually look like en.iso88591 or (for the location in your profile) fr_FR.iso88591, not just the encoding on its own - you might be able to see the full list with locale -a.

Example:

LC_CTYPE=fr_FR.iso88591 sed -e 's/--.*//'

Alternatively, if you know that the non-comment parts of the line contain only ASCII, you could split the line at a comment marker, print the first part and discard the remainder:

sed -e 's/--/\n/' -e 'P' -e 'd'

Upvotes: 0

Anonymoose

Reputation: 5982

It works for me. It's probably a character encoding problem.

This might help:

Upvotes: 5

Does . really match any character?

Answers (3)

Related Questions