Reputation: 1499
I have JSON files that are annotated with comments that I strip out before doing operations using jq
. I just hit an interesting problem in which I received a JSON file with comment annotations that included some rich-text quote characters (hex 93 and hex 94). My existing sed dot .
character did not match these characters. Here is a demonstration:
First, the input:
% echo -e '# \x93text\x94\n{"a":1}' | od -c
0000000 # 223 t e x t 224 \n { " a " : 1 }
0000020 \n
0000021
%
And here is the transform:
% echo -e '# \x93text\x94\n{"a":1}' | sed 's/^\s*#.*//' | od -c
0000000 223 t e x t 224 \n { " a " : 1 } \n
0000017
%
Note that the dot character in the sed expression is not matching the hex 93 character. However, if I include LC_ALL=C
:
% echo -e '# \x93text\x94\n{"a":1}' | LC_ALL=C sed 's/^\s*#.*//' | od -c
0000000 \n { " a " : 1 } \n
0000011
%
then the dot character in the sed expression does match the hex 93 and hex 94 characters. The sed documentation section Locale Considerations speaks of bracket expressions, but the behavior above seems to prove that this problem happens elsewhere.
It is interesting to note that deletion instead of substitution didn't show this problem:
% echo -e '# \x93text\x94\n{"a":1}' | sed '/^\s*#.*/d' | od -c
0000000 { " a " : 1 } \n
0000010
Given that I'm operating on annotated JSON files, I think the solution of adding LC_ALL=C
to sed statements is reasonable.
So, my question: Is using LC_ALL=C
something that I always want to use when doing non-locale-specific sed
transformations (as would be applicable in annotated JSON files)? If not, what alternatives exist to avoid the problem I've shown above?
My environment:
Upvotes: 1
Views: 446
Reputation: 10039
The C locale is a special locale that is meant to be the simplest locale. You could also say that while the other locales are for humans, the C locale is for computers. In the C locale, characters are single bytes, the charset is ASCII
On some systems, there's a difference with the POSIX locale where for instance the sort order for non-ASCII characters is not defined.
so LC_ALL=C is the secure way to take that non 8th bit character into account.
see comparaison
with LC, sed count as part of the character
echo -e '# \x93text\x94\n{"a":1}' | LC_ALL=C sed 's/[^[:alnum:]]/[HERE:&] /g' | od -c
0000000 [ H E R E : # ] [ H E R E :
0000020 ] [ H E R E : 223 ] t e x t [
0000040 H E R E : 224 ] \n [ H E R E : {
0000060 ] [ H E R E : " ] a [ H E R
0000100 E : " ] [ H E R E : : ] 1 [
0000120 H E R E : } ] \n
without LC, sed is not counted as part of the character to take into account ([[:alnum:]]
and [^[:alnum:]]
don't see 8th bit char)
echo -e '# \x93text\x94\n{"a":1}' | sed 's/[[:alnum:]]/[HERE:&] /g' | od -c
0000000 # 223 [ H E R E : t ] [ H E R
0000020 E : e ] [ H E R E : x ] [ H
0000040 E R E : t ] 224 \n { " [ H E R E
0000060 : a ] " : [ H E R E : 1 ] }
0000100 \n
echo -e '# \x93text\x94\n{"a":1}' | sed 's/[^[:alnum:]]/[HERE:&] /g' | od -c
0000000 [ H E R E : # ] [ H E R E :
0000020 ] 223 t e x t 224 \n [ H E R E : {
0000040 ] [ H E R E : " ] a [ H E R
0000060 E : " ] [ H E R E : : ] 1 [
0000100 H E R E : } ] \n
Upvotes: 0