user1264579
user1264579

Reputation: 75

Will sed (and others) corrupt non-ASCII files?

If I write some scripts that manipulate files like doing some search/replace with sed, and the files can be in various charsets, can the files be corrupted?

The text I wish to replace is ASCII and also only occurs on lines in the files that contain only ASCII but the rest of the lines contain characters in other charsets.

Upvotes: 3

Views: 1075

Answers (1)

Michał Kosmulski
Michał Kosmulski

Reputation: 10020

If your charsets are single-byte encodings (like the ISO-8859-n family) or UTF-8, where the newline character is the same as in ASCII, and the NUL character (\0) doesn't occur, your operation is likely to work. If the files use UTF-16, it will not (because of NULs). Why it should work for simple search and replacement of ASCII strings is: we assumed, your encoding is a superset of ASCII and for a simple match like this, sed will mostly work on the byte level and just replace one byte sequence with another.

But: with more complex operations, like when your replaced or replacement strings contain special characters, your results may vary. For example, the accented characters you enter on your command line might not fit the encoding in your file if console encoding/locale is different from file encoding. One can go around this, but it requires care.

Some operations in sed depend on your locale, for example which characters are considered alphanumeric. Compare for example the following replacement performed in Polish UTF-8 locale and in C locale which uses ASCII:

$ echo "gęś gęgała" | LC_ALL=pl_PL.UTF-8 sed -e 's/[[:alnum:]]/X/g'
XXX XXXXXX
$ echo "gęś gęgała" | LC_ALL=C sed -e 's/[[:alnum:]]/X/g'
Xęś XęXXłX

But if you only want to replace literal strings, it works as expected:

$ echo "gęś gęgała" | LC_ALL=pl_PL.UTF-8 sed -e 's/g/G/g'
Gęś GęGała
$ echo "gęś gęgała" | LC_ALL=C sed -e 's/g/G/g'
Gęś GęGała

As you see, the results differ because accented characters are treated differently depending on locale. In short: replacements of literal ASCII strings will most probably work OK, more complex operations need looking into and may either work or not.

Upvotes: 5

Related Questions