Reputation: 2671

Using awk to detect UTF-8 multibyte character

I am using awk (symlinked to gawk on my machine) to read through a file and get a character count per line to test if a file is fixed width. I can then re-use the following script with the -b --characters-as-bytes option to see if the file is fixed width by byte.

#!/usr/bin/awk -f

BEGIN {
    width = -1;
}

{
    len = length($0);

    if (width == -1) {
        width = len;
    } else if (len != 0 && len != width) {
        exit 1;
    }
}

I want to do something similar to test whether each line in a file has the same amount of bytes and characters to assume all characters are a single byte (I do realize this is subject false negatives). The challenge is that I would like to run through the file one time and break out at first mismatch. Is there a way to set the -b option from within an awk script similar to how you can adjust FS. If this isn't possible, I'm open to options outside of awk. I can always just write this in C if I have to, but I wanted to make sure there isn't something already available.

Efficiency is what I am aiming for here. Having this information will help me skip a costly process, so I don't this in itself to be costly. I'm dealing with files that can be over 100 million lines long.

Clarification

I want something like the above. Something like this

#!/usr/bin/awk -f
{
    if (length($0) != bytelength($0))
        exit 1;
}

I don't need any output. I will just trigger off the return code ($? in bash). So exit 1 if this fails. Obviously bytelength is not a function. I'm just looking for a way to achieve this without running awk twice.

UPDATE

sundeep's solution works for what I have described above:

awk -F '' -l ordchr '{for(i=1;i<=NF;i++) if(ord($i)<0) {exit 1;}}'

I was operating under the assumption that awk would count a higher-end character with a Windows single-byte encoding above 0x7F as a single character, but it actually doesn't count it at all. So byte length would still not be the same as length. I guess I will need to write this in C for something that specific.

Conclusion

So I think I did a poor job of explaining my problem. I receive data that is encoded in either UTF-8 or Windows' style single-byte encoding like CP1252. I wanted to check if there are any multibyte characters in the file and exit if found. I originally wanted to do this in awk, but I playing with files that may have a different encoding has proven difficult.

So in a nutshell if we assume a file with a single character in it:

CHARACTER  FILE_ENCODING     ALL_SINGLE_BYTE   IN_HEX
á          UTF-8             false             0xC3 0xA1
á          CP1252            true              0xE1
a          ANY               true              0x61

Upvotes: 3

Answers (6)

Jason

Reputation: 2671

Note : The code in this answer can be used to detect valid UTF-8 multi-byte characters. It will also fail if there are invalid UTF-8 byte sequences. However, it does not guarantee that your file is intended to be UTF-8. All valid UTF-8 code is also valid CP1252, but not all CP1252 is valid UTF-8.

So it seems this may be a bit of a niche problem. For me, that means time to resort to C. This should work, but, in the spirit of the question, I won't accept it in case someone can come up with an awk solution.

Here is my C solution I called hasmultibyte:

#include <stdio.h>
#include <stdlib.h>

void check_for_multibyte(FILE* in) 
{
        int c = 0;
        while ((c = getc(in)) != EOF) {
                /* Floating continuation byte */
                if ((c & 0xC0) == 0x80)
                        exit(5);

                /* utf8 multi-byte start */
                if ((c & 0xC0) == 0xC0) {
                        int continuations = 1;
                        switch (c & 0xF0) {
                        case 0xF0:
                                continuations = 3;
                                break;
                        case 0xE0:
                                continuations = 2;
                        }   
                        int i = 0;
                        for (; i < continuations; ++i)
                                if ((getc(in) & 0xC0) != 0x80)
                                        exit(5);

                        exit(0);
                }   
        }   
}

int main (int argc, char** argv)
{
        FILE* in = stdin;
        int i = 1;
        do {
                if (i != argc) {
                        in = fopen(argv[i], "r");
                        if (!in) {
                                perror(argv[i]);
                                exit(EXIT_FAILURE);
                        }   
                }   

                check_for_multibyte(in);

                if (in != stdin)
                        fclose(in);
        } while (++i < argc);

        return 5;
}

In the shell environment, you could then use it like this:

if hasmultibyte file.txt; then
    ...
fi

It will also read from stdin if not file is provided if you want to use it on the end of a pipeline:

if cat file.txt | hasmultibyte; then
    ...
fi

TEST

Here is a test of the program. I created 3 files with the name Hernández in it:

name_ascii.txt  - Uses a instead of á.
name_cp1252.txt - Encoded in CP1252
name_utf-8.txt  - Encoded in UTF-8 (default)

The � you see is due to the invalid UTF-8 that the terminal is expecting. It is, in fact the character á in CP1252.

> file name_*
name_ascii.txt:  ASCII text
name_cp1252.txt: ISO-8859 text
name_utf-8.txt:  UTF-8 Unicode text
> cat name_*
Hernandez
Hern�ndez
Hernández
> hasmultibyte name_ascii.txt && echo multibyte
> hasmultibyte name_cp1252.txt && echo multibyte
> hasmultibyte name_utf-8.txt && echo multibyte
multibyte

Update

This code has been updated from the original. It has been changed to read the first byte of a multibyte character and read how many bytes the character should be. This can be determined as follows.

first byte    number of bytes
110xxxxx      2
1110xxxx      3
11110xxx      4

This method is more reliable and will reduce inaccuracies. The original method searched for a byte of the form 11xxxxxx and checked the next byte for a continuation byte (10xxxxxx). That will produce a false positive given something like â„x in a CP1252 file. In binary, this is 11100010 10000100 01111000. The first byte claims a character of 3 bytes, the second is a continuation byte, but the third is not. This is not a valid UTF-8 sequence.

Additional testing

> # create files
> echo "â„¢" | iconv -f UTF-8 -t CP1252 > 3byte.txt
> echo "Ââ„¢" | iconv -f UTF-8 -t CP1252 > 3byte_fail.txt
> echo "â„x" | iconv -f UTF-8 -t CP1252 > 3byte_fail2.txt

> hasmultibyte 3byte.txt; echo $? 
0
> hasmultibyte 3byte_fail.txt; echo $? 
5
> hasmultibyte 3byte_fail2.txt; echo $? 
5

Upvotes: 1

RARE Kpop Manifesto

Reputation: 2841

== update = 9-20-21 ========

so turns out even the pre-slicing isn't necessary at all.

gawk -e 'BEGIN { ORS = ":";

   a0 = a = "\354\236\274"; 
   n = 1;                  # this # is for how many bytes 
                           # you'd like to see                
   b1 = b = \
       sprintf("%.*s",n + 1,a = "\301" a); 

   sub("^"b,   "", a) 
   sub(/^\301/,"", b) 
   sub("\236|\270|\271|\272|\273|\274|\275|\276|\277",":&", a)

       # for that string, 
       # chain up every byte in \x80-\xBF range, 
       # but make sure not to tag on "( )" at the 2 ends.
       # that will make the regex a lot slower,
       # for reasons unclear to me 

   printf(":" a0 "|" b1 "|"  b ORS a  "|") } ' | odview

yielding this output

     :  잼  **  **    | 301 354  | 354   :  236   : 274  |        
   072 354 236 274 174 301 354 174 354 072  236 072 274 174        
     :   ?  9e   ?   |   ?   ?  |   ?    :  9e   :   ?   |        
     58 236 158 188 124 193 236 124 236 58  158  58 188 124        
     3a  ec  9e  bc  7c  c1  ec 7c  ec  3a   9e  3a  bc  7c

voila ~~ using only sprintf() and [g]sub(), every individual byte is at ur fingertip, even when in unicode code, without needing to use arrays at all.

===========================

since we're on the topic of awk and UTF8, a quick tip share (only on the multi-byte part):

if you're in gawk unicode-aware mode, and wanna access individual bytes of just a few utf8 chars (e.g. performing URL encoding, analyze them individually, or like packing a DWORD32), but don't wanna use the cost-heavy approach of gsub(//,"&"SUBSEP) then splitting into an array, a quick-n-dirty method is just

   gsub(/\302|\303|\304|\305|\306|\307|\310|\311|\312\ 
        |\313|\314|\315|\316|\317|\320|\321|\322|\323|\324
        |\325|\326|\327|\330|\331|\332|\333|\334|\335|\336
        |\337|\340|\341|\342|\343|\344|\345|\346|\347|\350
        |\351|\352|\353|\354|\355|\356|\357|\360|\361|\362
        |\363|\364/, "&\300")



  잼  **  **    =    354 *300*<---236 274                                 
  354 236 274  075  354 300    236 274                                
   ?  9e   ?    =    ?   ?      9e   ?                                
  236 158 188  61   236 192    158 188                                
   ec  9e  bc  3d    ec  c0     9e  bc

Basically, "slicing" properly encoded UTF8 characters right between the leading byte and the trailing ones. In my personal trial-and-error, i find the 13 bytes illegal within UTF8 (xC0 xC1 xF5-xFF) to be best suited for this task.

say original var is called b3. then use

b2 = sprintf("%.3s",b3) to extract out \354 \300 \236.

sub(b2,"",b3) so now b3 will only have \274.

b1 = sprintf("%.1s", b2) b1 will now just now \354

sub(b1"\300","",b2) and finally, b2 will actually just be the 2nd byte of \236

The reason why this painfully tedious process is that 1 gsub doubling every byte then another full array split() plus 3 more array entry lookups can be slightly slow. If you wanna count bytes first,

   lenBytes = match($0, /$/) - 1; 
                               
    # i only recently discovered 
    # this trick that works decently well

that match one even works for randon collection of bytes that have no resemblance to Unicode, and gawk is very happy to give you the exact result. That's the only meaningful way to run match( ) against random bytes and not get an error message from gawk. (the other being match($0,/^/) but that's quite uselsss. try doing .* / . / .+ all will end up erroring about bad character in locale.

** don't use index( ). if you need exact positions, then just split into array.

And if you need to do byte-level substring

Don't directly use substr() for random bytes in gawk unicode-mode.

Use sprintf("%.53s",b3) instead. Before slicing, that syntax gives you 53 unicode characters. After slicing, it's 53 bytes from start of string.

i even chain them up myself as if they're gensub() even though it's good ole' sub() :

if (sub(reANY340357,"&\301",z)||3==b) {
    sub((x=sprintf("%.1s",(y=sprintf("%.3s",z))sub(y,"",z)))"\301","",y)

And once you're done with everything you need, a quick gsub(/\300|\301/, "") will restore you the proper UTF8 string.

Hope this is useful =)

Upvotes: 1

RARE Kpop Manifesto

Reputation: 2841

for non-unicode aware versions of awk,

gawk -b/ LC_ALL=C /mawk/mawk2 'BEGIN { 

   reUTF8="([\\000-\\177]|" \
          "[\\302-\\337][\\200-\\277]|" \
          "\\340[\\240-\\277][\\200-\\277]|" \
          "\\355[\\200-\\237][\\200-\\277]|" \
          "[\\341-\\354\\356-\\357][\\200-\\277]" \
          "[\\200-\\277]|\\360[\\220-\\277]" \
          "[\\200-\\277][\\200-\\277]|" \
          "[\\361-\\363][\\200-\\277][\\200-\\277]" \
          "[\\200-\\277]|\\364[\\200-\\217]" \
          "[\\200-\\277][\\200-\\277])" }'

Set this regex. You should be able to get total UTF8-compliant character count as counted by gnu-wc -lcm, even for purely binary files like mp3s or mp4s or compressed gz/xz/zip that. As long as your data itself is UTF8-compliant, then this will count it, as specified in Unicode 13.

Your locale settings don't matter here whatsoever, nor is your platform, OS version, awk version, or awk variant.

$ echo; time pvE0 < MV84/*BLITZE*webm | gwc -lcm

      in0:  449MiB 0:00:10 [44.4MiB/s] [44.4MiB/s] [================================================>] 100%            
1827289 250914815 471643928

real    0m10.188s
user    0m10.075s
sys 0m0.352s
$ echo; time pvE0 < MV84/*BLITZE*webm | mawk2x 'BEGIN { FS = "^$"} { bytes += lengthB0(); chars += lengthC0(); } END { print --NR, chars+NR, bytes+NR }'

      in0:  449MiB 0:00:16 [27.0MiB/s] [27.0MiB/s] [================================================>] 100%            
1827289=250914815=471643928

real    0m16.756s
user    0m16.621s
sys 0m0.449s

the file being tested is a 449 MB .webm music video clip from youtube that's 3840x2160 VP9 + Opus codecs. not too shabby for an interpreted scripting language to be this close to compiled C-binaries.

And it's only this slow for the hideously long regex to account for invalid bytes. If you're extremely sure your data is fully UTF8 compliant text, you can further optimize that regex so that mawk2 can go faster than both gnu-wc and bsd-wc :

$  brc; time pvE0 < "${m3t}" | awkwc4m
      in0: 1.85GiB 0:00:14 [ 128MiB/s] [ 128MiB/s] [================================================>] 100%            
  12,494,275 lines     1,285,316,715 utf8 (349,725,658 uc)     1,891.656 MB (  1983544693)  /dev/stdin

real    0m14.753s <—- Custom Bash function that's entirely AWK

$  brc; time pvE0 < "${m3t}" |gwc -lcm
      in0: 1.85GiB 0:00:28 [67.3MiB/s] [67.3MiB/s] [================================================>] 100%            
12494275 1285316715 1983544693

real    0m28.165s <—— GNU WC

$  brc; time pvE0 < "${m3t}" |wc -lcm
      in0: 1.85GiB 0:00:22 [85.5MiB/s] [85.5MiB/s] [================================================>] 100%            
 12494275 1285316715

real    0m22.181s  <——  BSD WC

ps : "${m3t}" is a 1.85GB flat .txt file that's 12.5 million rows, and 13 fields each, filled to the brim with multibyte unicode characters (349.7 million of them).

gawk -e (in unicode mode) will complain about that regex. To circumvent that annoyance, use this regex which is the same as the one above, but expanded out to make gawk -e happy

([\000-\177]|((\302|\303|\304|\305|\306|\307|\310|\311|\312|\313|\314|\315|\316|\317|\320|\321|\322|\323|\324|\325|\326|\327|\330|\331|\332|\333|\334|\335|\336|\337)|(\340)(\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|(\355)(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237))(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|((\341|\342|\343|\344|\345|\346|\347|\350|\351|\352|\353|\354|\356|\357)|(\360)(\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|(\361|\362|\363)(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277)|(\364)(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217))(\200|\201|\202|\203|\204|\205|\206|\207|\210|\211|\212|\213|\214|\215|\216|\217|\220|\221|\222|\223|\224|\225|\226|\227|\230|\231|\232|\233|\234|\235|\236|\237|\240|\241|\242|\243|\244|\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257|\260|\261|\262|\263|\264|\265|\266|\267|\270|\271|\272|\273|\274|\275|\276|\277){2})

Upvotes: 1

RARE Kpop Manifesto

Reputation: 2841

Quote from the same wikipedia page above :

Fallback and auto-detection: Only a small subset of possible byte strings are a valid UTF-8 string: the bytes C0, C1, and F5 through FF cannot appear, and bytes with the high bit set must be in pairs, and other requirements.

in octal code that means xC0 = \300, xC1 = \301 and xF5 = \365 -> xFF = \377 being non-valid UTF-8.

Knowing that this space isn't valid UTF-8 is plenty useful in terms of wiggle room for one to insert custom delimiters inside any string :

pick any of those bytes, say \373, and once a quick if statement is used to verify it doesn't exist for that line, you can now perform custom text manipulation tricks of your choice, with a single-byte delimiter, even if it involves inserting them right in between the UTF8 bytes for a single code point, and it won't ruin the unicode at all. once you're done with the logic block, simply use a quick gsub( ) to remove all traces of it.

If that byte (\373 ie \xFB) exist, well, you're likely encountering either a binary file, or partially corrupted UTF8 text data.

One use case, such as in my own modules, is a UTF-8 code-point-level-safe* substr( ) function. So instead of manually counting out the points 1 at a time, first use regex to count out max bytes of any code-point. Let's say 3-bytes (since 4-bytes ones are still rare in practice).

Then i apply 1 pad of \373 next to the 2-byte ones (I pad it to the left of [\302-\337]), and 2 pads of it, i.e. \373\373, next to ASCII ones, and voila, now all UTF8 code points have a fixed width, so a substr( ) becomes a mere multiplication exercise of it.

run a byte-level substr( ) on those start and end points, apply gsub(/[\373]+/, "", s) to throw away all the padding bytes, and now you have a usable* UTF-8-safe substr( ) function for all the variants of awk that aren't unicode-aware. This approach also works for multi-line records, and absolutely does not affect how FS and RS interacts with the record.

(if u need 4-bytes, just pad more)

*i haven't incorporated any fancy logic to account for code-points that are post-decomposition components that supposedly grouped together as a single logical unit for string manipulation purposes.

Upvotes: 1

KamilCuk

Reputation: 141165

You seem to be targeting UTF-8 specifically. Indeed first multibyte character in UTF-8 encoding starts 0b11xxxxxx and the next byte needs to be 0b10xxxxxx where x represents any value (from wikipedia).

So you can detect such sequence with sed by matching the hex ranges and exit with nonzero exit status if found:

LC_ALL=C sed -n '/[\xC0-\xFF][\x80-\xBF]/q1'

Ie. match bytes in ranges [0b11000000-0b11111111][0b10000000-0b10111111].

I think \x?? and q are both GNU extensions to sed.

Upvotes: 5

Enlico

Reputation: 28416

The best answer is imho actually the one with grep provided by Sundeep in the comment. You should try to get that working. The answer below makes use of sed in a similar way. I will probably delete it, as it's really doesn't add anything to grep's solution.

What about this?

[[ -z "$(LANG=C sed -z '/[\x80-\xFF]/d' <(echo -e 'one\ntwo\nth⌫ree'))" ]]
echo $?

<(echo -e 'one\ntwo\nth⌫ree') is just an example file with a multibyte character in it
the whole sed command does one of two things:
- outputs the empty string if the file contains a multibyte character
- outputs the full file if it doesn't
the [[ -z string ]] returns 0 or 1 if the string has length zero.

Upvotes: 1

Using awk to detect UTF-8 multibyte character

Answers (6)

Related Questions