Jay Gray
Jay Gray

Reputation: 1726

Using awk, how to replace one string with another?

Need to create an awk script to convert a glyph (https://en.wikipedia.org/wiki/Glyph) to Unicode (JavaScript syntax), and the reverse - Unicode to a glyph.

Source data is stored in NotePad++ with UTF-8 encoding.

Here's my progress.

Use_case_1

Dictionary file (dict_1_.txt):

A \u0041
À \u00C0

Input file (input_1_.txt):

A
À

awk script for generating Unicode for equivalent glyph:

awk 'NR == FNR { a[$1] = $2; next } $1 in a { $1 = a[$1] } $2 in a { $2 = a[$2] } 1' dict_1_.txt input_1_.txt

correctly producing:

\u0041
\u00C0

Use_case_2

Dictionary file (dict_2_.txt)

\u0041 A
\u00C0 À

Input file (input_2_.txt)

\u0041
\u00C0

awk script for generating glyphs for equivalent Unicode:

awk 'NR == FNR { a[$1] = $2; next } $1 in a { $1 = a[$1] } $2 in a { $2 = a[$2] } 1' dict_2.txt input_2.txt

correctly producing:

A
À

So, can successfully "round-trip" on a single symbol.

But how to deal with a more comprehensive dictionary and more than one word per row?

Here is sample data.

Input file (input_3_.txt)

PUDÍN, ALMIDÓN

Dictionary file (dict_3_.txt)

,   \u002C
A   \u0041
D   \u0044
I   \u0049
Í   \u00CD
L   \u004C
M   \u004D
N   \u006E
Ó   \u00D3
P   \u0050
U   \u0055
<space> \u0020

The awk script should generate:

\u0050\u0055\u0044\u00CD\u006E\u002C\u002C\u0041\u004C\u004D\u0049\u0044\u00D3\u006E

Input file (input_4_.txt)

\u0050\u0055\u0044\u00CD\u006E\u002C\u002C\u0041\u004C\u004D\u0049\u0044\u00D3\u006E

Dictionary file (dict_4_.txt)

\u002C  ,
\u0041  A
\u0044  D
\u0049  I
\u00CD  Í
\u004C  L
\u004D  M
\u006E  N
\u00D3  Ó
\u0050  P
\u0055  U
\u0020  <space>

The awk script should generate:

PUDÍN, ALMIDÓN

Here is a more complicated set of input strings (one per row):

MONO Y DIACETIL ÉSTERES DEL ÁCIDO TARTÁRICO DE MONO Y DIGLICÉRIDOS DE ÁCIDOS GRASOS AÑADIDOS
043 HUEVAS DE PESCADO (INCLUYENDO ESPERMA=HUEVAS BLANDAS) Y VÍSCERAS COMESTIBLES DE PESCADO
ACEITE DE SOJA OXIDADO TÉRMICAMENTE Y EN INTERACCIÓN CON MONO Y DIGLICÉRIDOS DE ÁCIDOS GRASOS
BANDEJA PLÁSTICA O CAZUELA, CUBIERTA DE PAPEL DE ALUMINIO O ENVOLTURA

In the Dictionary examples above, have used <space> to indicate the 'symbol' between words and after a comma. This probably means that a solution should use \t for FS in both the Dictionary file and the Input file. Currently the FS is a keyboard 'space'. Also the RS is \n.

Further, I need to do the same for hexadecimal, so a solution needs to process a Dictionary file like this:

Í   &#xcd;
Ó   &#xd3;

as compared to the Dictionary example above:

Í   \u00CD
Ó   \u00D3

How to improve or replace my simple awk scripts with scripts that process the longer strings on multiple lines?

Upvotes: 0

Views: 587

Answers (1)

karakfa
karakfa

Reputation: 67467

here is one approach, note that you don't need two different versions of the dictionary.

With little effort these two can be combined into one script and from/to conversion can be controlled with a parameter. I intentionally kept the dictionary part the same

$ awk 'NR==FNR {$2=$2?$2:" "; u2a[$1]=$2; a2u[$2]=$1; next}
               {for(i=1;i<=NF;i++) $i=a2u[$i]}1' dict FS='' OFS='' input

\u0050\u0055\u0044\u00CD\u006E\u002C\u0020\u0041\u004C\u004D\u0049\u0044\u00D3\u006E

working with the encoded input now

$ awk 'NR==FNR {$2=$2?$2:" "; u2a[$1]=$2; a2u[$2]=$1; next}
               {enc=$0; gsub(/....../,"& ",enc); n=split(enc,a);
                for(i=1;i<=n;i++) line=line u2a[a[i]]; print line}' dict encoded_input

PUDÍN, ALMIDÓN

using your dict_4 as the dictionary for both scripts

Upvotes: 1

Related Questions