Nolias
Nolias

Reputation: 111

bash replace string by 4-digit hex code

I have strings:

^[U0422^Z ^[U041D^Z^[U0410^Z ^[U0412^Z^[U042B^Z^[U0417^Z === Т НА ВЫЗ

etc. And I want do sed on that string, to replace the codes ^[Uxxxx^Z

How can I do this if sed only accept 2-digit hex codes? I have 3 GB data with characters encoded like this... I need do this in script, because I have multiple files and 152 characters to decode...

Upvotes: 1

Views: 384

Answers (1)

l'L'l
l'L'l

Reputation: 47159

You can use perl, here's an example:

file.txt:

Żelazna ręka Marsa - J^[U00F8^Zrstad, Jarl. ^[U0422^Z ^[U041D^Z^[U0410^Z ^[U0412^Z^[U042B^Z^[U0417^Z

script.pl

#!/usr/bin/perl

open my $in,  '<:encoding(UTF-8)', $ARGV[0] or die $!;
open my $out, '>:encoding(UTF-8)', $ARGV[1] or die $!;

while (<$in>) {
    $_ =~ s/\^\[U([0-9A-Fa-f]{4})\^Z/sprintf "%c", hex($1)/ge;
    print $out $_; 
}

close $in;
close $out;

Syntax is ./script.pl <input> <output>.

output:

$ ./script.pl
Żelazna ręka Marsa - Jørstad, Jarl. Т НА ВЫЗ

recursive version:

#!/usr/bin/perl

use strict;
use warnings;
use File::Find;

my @files = <*.txt>;
 for my $file (@files) {

  open my $in,  '<:encoding(UTF-8)', $file or die $!;
  open my $out, '>:encoding(UTF-8)', $ARGV[0] . "_" . $file or die $!;

  while (<$in>) {
    $_ =~ s/\^\[U([0-9A-Fa-f]{4})\^Z/sprintf "%c", hex($1)/ge;
    print $out $_; 
 }
close $in;
close $out;
}

Syntax is ./script.pl <prefix>. If data.txt was found the new file would be prefix_data.txt.

Upvotes: 1

Related Questions