Steve Amerige
Steve Amerige

Reputation: 1499

Should LC_ALL=C Always be used for Non-Locale-Specific sed Operations?

I have JSON files that are annotated with comments that I strip out before doing operations using jq. I just hit an interesting problem in which I received a JSON file with comment annotations that included some rich-text quote characters (hex 93 and hex 94). My existing sed dot . character did not match these characters. Here is a demonstration:

First, the input:

% echo -e '# \x93text\x94\n{"a":1}' | od -c
0000000   #     223   t   e   x   t 224  \n   {   "   a   "   :   1   }
0000020  \n
0000021
%

And here is the transform:

% echo -e '# \x93text\x94\n{"a":1}' | sed 's/^\s*#.*//' | od -c
0000000 223   t   e   x   t 224  \n   {   "   a   "   :   1   }  \n
0000017
%

Note that the dot character in the sed expression is not matching the hex 93 character. However, if I include LC_ALL=C:

% echo -e '# \x93text\x94\n{"a":1}' | LC_ALL=C sed 's/^\s*#.*//' | od -c
0000000  \n   {   "   a   "   :   1   }  \n
0000011
%

then the dot character in the sed expression does match the hex 93 and hex 94 characters. The sed documentation section Locale Considerations speaks of bracket expressions, but the behavior above seems to prove that this problem happens elsewhere.

It is interesting to note that deletion instead of substitution didn't show this problem:

% echo -e '# \x93text\x94\n{"a":1}' | sed '/^\s*#.*/d' | od -c         
0000000   {   "   a   "   :   1   }  \n
0000010

Given that I'm operating on annotated JSON files, I think the solution of adding LC_ALL=C to sed statements is reasonable.

So, my question: Is using LC_ALL=C something that I always want to use when doing non-locale-specific sed transformations (as would be applicable in annotated JSON files)? If not, what alternatives exist to avoid the problem I've shown above?

My environment:

Upvotes: 1

Views: 446

Answers (1)

NeronLeVelu
NeronLeVelu

Reputation: 10039

The C locale is a special locale that is meant to be the simplest locale. You could also say that while the other locales are for humans, the C locale is for computers. In the C locale, characters are single bytes, the charset is ASCII

On some systems, there's a difference with the POSIX locale where for instance the sort order for non-ASCII characters is not defined.

so LC_ALL=C is the secure way to take that non 8th bit character into account.

see comparaison

with LC, sed count as part of the character

echo -e '# \x93text\x94\n{"a":1}' | LC_ALL=C sed 's/[^[:alnum:]]/[HERE:&] /g' | od -c
0000000   [   H   E   R   E   :   #   ]       [   H   E   R   E   :
0000020   ]       [   H   E   R   E   : 223   ]       t   e   x   t   [
0000040   H   E   R   E   : 224   ]      \n   [   H   E   R   E   :   {
0000060   ]       [   H   E   R   E   :   "   ]       a   [   H   E   R
0000100   E   :   "   ]       [   H   E   R   E   :   :   ]       1   [
0000120   H   E   R   E   :   }   ]      \n

without LC, sed is not counted as part of the character to take into account ([[:alnum:]] and [^[:alnum:]] don't see 8th bit char)

 echo -e '# \x93text\x94\n{"a":1}' | sed 's/[[:alnum:]]/[HERE:&] /g' | od -c
0000000   #     223   [   H   E   R   E   :   t   ]       [   H   E   R
0000020   E   :   e   ]       [   H   E   R   E   :   x   ]       [   H
0000040   E   R   E   :   t   ]     224  \n   {   "   [   H   E   R   E
0000060   :   a   ]       "   :   [   H   E   R   E   :   1   ]       }
0000100  \n

echo -e '# \x93text\x94\n{"a":1}' | sed 's/[^[:alnum:]]/[HERE:&] /g' | od -c
0000000   [   H   E   R   E   :   #   ]       [   H   E   R   E   :
0000020   ]     223   t   e   x   t 224  \n   [   H   E   R   E   :   {
0000040   ]       [   H   E   R   E   :   "   ]       a   [   H   E   R
0000060   E   :   "   ]       [   H   E   R   E   :   :   ]       1   [
0000100   H   E   R   E   :   }   ]      \n

Upvotes: 0

Related Questions