Reputation: 1726
Need to create an awk
script to convert a glyph
(https://en.wikipedia.org/wiki/Glyph) to Unicode
(JavaScript
syntax), and the reverse - Unicode to a glyph.
Source data is stored in NotePad++
with UTF-8
encoding.
Here's my progress.
Use_case_1
Dictionary file (dict_1_.txt):
A \u0041
À \u00C0
Input file (input_1_.txt):
A
À
awk
script for generating Unicode for equivalent glyph:
awk 'NR == FNR { a[$1] = $2; next } $1 in a { $1 = a[$1] } $2 in a { $2 = a[$2] } 1' dict_1_.txt input_1_.txt
correctly producing:
\u0041
\u00C0
Use_case_2
Dictionary file (dict_2_.txt)
\u0041 A
\u00C0 À
Input file (input_2_.txt)
\u0041
\u00C0
awk
script for generating glyphs for equivalent Unicode:
awk 'NR == FNR { a[$1] = $2; next } $1 in a { $1 = a[$1] } $2 in a { $2 = a[$2] } 1' dict_2.txt input_2.txt
correctly producing:
A
À
So, can successfully "round-trip" on a single symbol.
But how to deal with a more comprehensive dictionary and more than one word per row?
Here is sample data.
Input file (input_3_.txt)
PUDÍN, ALMIDÓN
Dictionary file (dict_3_.txt)
, \u002C
A \u0041
D \u0044
I \u0049
Í \u00CD
L \u004C
M \u004D
N \u006E
Ó \u00D3
P \u0050
U \u0055
<space> \u0020
The awk
script should generate:
\u0050\u0055\u0044\u00CD\u006E\u002C\u002C\u0041\u004C\u004D\u0049\u0044\u00D3\u006E
Input file (input_4_.txt)
\u0050\u0055\u0044\u00CD\u006E\u002C\u002C\u0041\u004C\u004D\u0049\u0044\u00D3\u006E
Dictionary file (dict_4_.txt)
\u002C ,
\u0041 A
\u0044 D
\u0049 I
\u00CD Í
\u004C L
\u004D M
\u006E N
\u00D3 Ó
\u0050 P
\u0055 U
\u0020 <space>
The awk
script should generate:
PUDÍN, ALMIDÓN
Here is a more complicated set of input strings (one per row):
MONO Y DIACETIL ÉSTERES DEL ÁCIDO TARTÁRICO DE MONO Y DIGLICÉRIDOS DE ÁCIDOS GRASOS AÑADIDOS
043 HUEVAS DE PESCADO (INCLUYENDO ESPERMA=HUEVAS BLANDAS) Y VÍSCERAS COMESTIBLES DE PESCADO
ACEITE DE SOJA OXIDADO TÉRMICAMENTE Y EN INTERACCIÓN CON MONO Y DIGLICÉRIDOS DE ÁCIDOS GRASOS
BANDEJA PLÁSTICA O CAZUELA, CUBIERTA DE PAPEL DE ALUMINIO O ENVOLTURA
In the Dictionary examples above, have used <space>
to indicate the 'symbol' between words and after a comma. This probably means that a solution should use \t
for FS
in both the Dictionary file and the Input file. Currently the FS
is a keyboard 'space'. Also the RS
is \n
.
Further, I need to do the same for hexadecimal, so a solution needs to process a Dictionary file like this:
Í Í
Ó Ó
as compared to the Dictionary example above:
Í \u00CD
Ó \u00D3
How to improve or replace my simple awk
scripts with scripts that process the longer strings on multiple lines?
Upvotes: 0
Views: 587
Reputation: 67467
here is one approach, note that you don't need two different versions of the dictionary.
With little effort these two can be combined into one script and from/to conversion can be controlled with a parameter. I intentionally kept the dictionary part the same
$ awk 'NR==FNR {$2=$2?$2:" "; u2a[$1]=$2; a2u[$2]=$1; next}
{for(i=1;i<=NF;i++) $i=a2u[$i]}1' dict FS='' OFS='' input
\u0050\u0055\u0044\u00CD\u006E\u002C\u0020\u0041\u004C\u004D\u0049\u0044\u00D3\u006E
working with the encoded input now
$ awk 'NR==FNR {$2=$2?$2:" "; u2a[$1]=$2; a2u[$2]=$1; next}
{enc=$0; gsub(/....../,"& ",enc); n=split(enc,a);
for(i=1;i<=n;i++) line=line u2a[a[i]]; print line}' dict encoded_input
PUDÍN, ALMIDÓN
using your dict_4 as the dictionary for both scripts
Upvotes: 1