Reputation: 153
GNU sed version 4.1.5
seems to fail with International chars. Here is my input file:
Gras Och Stenar Trad - From Moja to Minneapolis DVD [G2007DVD] 7812 | X
<br>
Gras Och Stenar Trad - From Möja to Minneapolis DVD [G2007DVD] 7812 | Y
(Note the umlaut in the second line.)
And when I do
sed 's/.*| //' < in
I would expect to see only the X
and Y
, as I've asked to remove ALL chars up to the '|'
and space beyond it. Instead, I get:
X<br>
Gras Och Stenar Trad - From M? Y
I know I can use tr to remove the International chars. first, but is there a way to just use sed?
Upvotes: 15
Views: 17763
Reputation:
sed
is not very well setup for non-ASCII text. However you can use (almost) the same code in perl
and get the result you want:
perl -pe 's/.*\| //' x
Upvotes: 12
Reputation: 86492
I think the error occurs if the input encoding of the file is different from the preferred encoding of your environment.
Example: in
is UTF-8
$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Y
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X
Y
UTF-8 can safely be interpreted as ISO-8859-1, you'll get strange characters but apart from that everything is fine.
Example: in
is ISO-8859-1
$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Gras Och Stenar Trad - From MöY
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X
Y
ISO-8859-1 cannot be interpreted as UTF-8, decoding the input file fails. The strange match is probably due to the fact that sed tries to recover rather than fail completely.
The answer is based on Debian Lenny/Sid and sed 4.1.5.
Upvotes: 25