Reputation: 2671
I am using awk (symlinked to gawk on my machine) to read through a file and get a character count per line to test if a file is fixed width. I can then re-use the following script with the -b --characters-as-bytes
option to see if the file is fixed width by byte.
#!/usr/bin/awk -f
BEGIN {
width = -1;
}
{
len = length($0);
if (width == -1) {
width = len;
} else if (len != 0 && len != width) {
exit 1;
}
}
I want to do something similar to test whether each line in a file has the same amount of bytes and characters to assume all characters are a single byte (I do realize this is subject false negatives). The challenge is that I would like to run through the file one time and break out at first mismatch. Is there a way to set the -b
option from within an awk script similar to how you can adjust FS. If this isn't possible, I'm open to options outside of awk. I can always just write this in C
if I have to, but I wanted to make sure there isn't something already available.
Efficiency is what I am aiming for here. Having this information will help me skip a costly process, so I don't this in itself to be costly. I'm dealing with files that can be over 100 million lines long.
Clarification
I want something like the above. Something like this
#!/usr/bin/awk -f
{
if (length($0) != bytelength($0))
exit 1;
}
I don't need any output. I will just trigger off the return code ($?
in bash). So exit 1 if this fails. Obviously bytelength is not a function. I'm just looking for a way to achieve this without running awk twice.
UPDATE
sundeep's solution works for what I have described above:
awk -F '' -l ordchr '{for(i=1;i<=NF;i++) if(ord($i)<0) {exit 1;}}'
I was operating under the assumption that awk
would count a higher-end character with a Windows single-byte encoding above 0x7F as a single character, but it actually doesn't count it at all. So byte length would still not be the same as length. I guess I will need to write this in C for something that specific.
Conclusion
So I think I did a poor job of explaining my problem. I receive data that is encoded in either UTF-8 or Windows' style single-byte encoding like CP1252. I wanted to check if there are any multibyte characters in the file and exit if found. I originally wanted to do this in awk, but I playing with files that may have a different encoding has proven difficult.
So in a nutshell if we assume a file with a single character in it:
CHARACTER FILE_ENCODING ALL_SINGLE_BYTE IN_HEX
á UTF-8 false 0xC3 0xA1
á CP1252 true 0xE1
a ANY true 0x61
Upvotes: 3
Views: 7727
Reputation: 2671
Note : The code in this answer can be used to detect valid UTF-8 multi-byte characters. It will also fail if there are invalid UTF-8 byte sequences. However, it does not guarantee that your file is intended to be UTF-8. All valid UTF-8 code is also valid CP1252, but not all CP1252 is valid UTF-8.
So it seems this may be a bit of a niche problem. For me, that means time to resort to C. This should work, but, in the spirit of the question, I won't accept it in case someone can come up with an awk
solution.
Here is my C solution I called hasmultibyte
:
#include <stdio.h>
#include <stdlib.h>
void check_for_multibyte(FILE* in)
{
int c = 0;
while ((c = getc(in)) != EOF) {
/* Floating continuation byte */
if ((c & 0xC0) == 0x80)
exit(5);
/* utf8 multi-byte start */
if ((c & 0xC0) == 0xC0) {
int continuations = 1;
switch (c & 0xF0) {
case 0xF0:
continuations = 3;
break;
case 0xE0:
continuations = 2;
}
int i = 0;
for (; i < continuations; ++i)
if ((getc(in) & 0xC0) != 0x80)
exit(5);
exit(0);
}
}
}
int main (int argc, char** argv)
{
FILE* in = stdin;
int i = 1;
do {
if (i != argc) {
in = fopen(argv[i], "r");
if (!in) {
perror(argv[i]);
exit(EXIT_FAILURE);
}
}
check_for_multibyte(in);
if (in != stdin)
fclose(in);
} while (++i < argc);
return 5;
}
In the shell environment, you could then use it like this:
if hasmultibyte file.txt; then
...
fi
It will also read from stdin if not file is provided if you want to use it on the end of a pipeline:
if cat file.txt | hasmultibyte; then
...
fi
TEST
Here is a test of the program. I created 3 files with the name Hernández in it:
name_ascii.txt - Uses a instead of á.
name_cp1252.txt - Encoded in CP1252
name_utf-8.txt - Encoded in UTF-8 (default)
The � you see is due to the invalid UTF-8 that the terminal is expecting. It is, in fact the character á in CP1252.
> file name_*
name_ascii.txt: ASCII text
name_cp1252.txt: ISO-8859 text
name_utf-8.txt: UTF-8 Unicode text
> cat name_*
Hernandez
Hern�ndez
Hernández
> hasmultibyte name_ascii.txt && echo multibyte
> hasmultibyte name_cp1252.txt && echo multibyte
> hasmultibyte name_utf-8.txt && echo multibyte
multibyte
Update
This code has been updated from the original. It has been changed to read the first byte of a multibyte character and read how many bytes the character should be. This can be determined as follows.
first byte number of bytes
110xxxxx 2
1110xxxx 3
11110xxx 4
This method is more reliable and will reduce inaccuracies. The original method searched for a byte of the form 11xxxxxx
and checked the next byte for a continuation byte (10xxxxxx). That will produce a false positive given something like â„x
in a CP1252 file. In binary, this is 11100010 10000100 01111000
. The first byte claims a character of 3 bytes, the second is a continuation byte, but the third is not. This is not a valid UTF-8 sequence.
Additional testing
> # create files
> echo "â„¢" | iconv -f UTF-8 -t CP1252 > 3byte.txt
> echo "Ââ„¢" | iconv -f UTF-8 -t CP1252 > 3byte_fail.txt
> echo "â„x" | iconv -f UTF-8 -t CP1252 > 3byte_fail2.txt
> hasmultibyte 3byte.txt; echo $?
0
> hasmultibyte 3byte_fail.txt; echo $?
5
> hasmultibyte 3byte_fail2.txt; echo $?
5
Upvotes: 1
Reputation: 2841
== update = 9-20-21 ========
so turns out even the pre-slicing isn't necessary at all.
gawk -e 'BEGIN { ORS = ":";
a0 = a = "\354\236\274";
n = 1; # this # is for how many bytes
# you'd like to see
b1 = b = \
sprintf("%.*s",n + 1,a = "\301" a);
sub("^"b, "", a)
sub(/^\301/,"", b)
sub("\236|\270|\271|\272|\273|\274|\275|\276|\277",":&", a)
# for that string,
# chain up every byte in \x80-\xBF range,
# but make sure not to tag on "( )" at the 2 ends.
# that will make the regex a lot slower,
# for reasons unclear to me
printf(":" a0 "|" b1 "|" b ORS a "|") } ' | odview
yielding this output
: 잼 ** ** | 301 354 | 354 : 236 : 274 |
072 354 236 274 174 301 354 174 354 072 236 072 274 174
: ? 9e ? | ? ? | ? : 9e : ? |
58 236 158 188 124 193 236 124 236 58 158 58 188 124
3a ec 9e bc 7c c1 ec 7c ec 3a 9e 3a bc 7c
voila ~~ using only sprintf() and [g]sub(), every individual byte is at ur fingertip, even when in unicode code, without needing to use arrays at all.
===========================
since we're on the topic of awk and UTF8, a quick tip share (only on the multi-byte part):
if you're in gawk unicode-aware mode, and wanna access individual bytes of just a few utf8 chars (e.g. performing URL encoding
, analyze them individually, or like packing a DWORD32
), but don't wanna use the cost-heavy approach of gsub(//,"&"SUBSEP)
then splitting into an array, a quick-n-dirty method is just
gsub(/\302|\303|\304|\305|\306|\307|\310|\311|\312\
|\313|\314|\315|\316|\317|\320|\321|\322|\323|\324
|\325|\326|\327|\330|\331|\332|\333|\334|\335|\336
|\337|\340|\341|\342|\343|\344|\345|\346|\347|\350
|\351|\352|\353|\354|\355|\356|\357|\360|\361|\362
|\363|\364/, "&\300")
잼 ** ** = 354 *300*<---236 274
354 236 274 075 354 300 236 274
? 9e ? = ? ? 9e ?
236 158 188 61 236 192 158 188
ec 9e bc 3d ec c0 9e bc
Basically, "slicing" properly encoded UTF8 characters right between the leading byte and the trailing ones. In my personal trial-and-error, i find the 13 bytes illegal within UTF8 (xC0 xC1 xF5-xFF
) to be best suited for this task.
say original var is called b3. then use
b2 = sprintf("%.3s",b3)
to extract out \354 \300 \236.
sub(b2,"",b3)
so now b3 will only have \274.
b1 = sprintf("%.1s", b2)
b1 will now just now \354
sub(b1"\300","",b2)
and finally, b2 will actually just be the 2nd byte of \236
The reason why this painfully tedious process is that 1 gsub doubling every byte then another full array split() plus 3 more array entry lookups can be slightly slow. If you wanna count bytes first,
lenBytes = match($0, /$/) - 1; # i only recently discovered # this trick that works decently well
that match one even works for randon collection of bytes that have no resemblance to Unicode, and gawk is very happy to give you the exact result. That's the only meaningful way to run match( ) against random bytes and not get an error message from gawk. (the other being match($0,/^/) but that's quite uselsss. try doing .* / . / .+ all will end up erroring about bad character in locale.
** don't use index( ). if you need exact positions, then just split into array.
And if you need to do byte-level substring
Don't directly use
substr()
for random bytes in gawk unicode-mode.Use
sprintf("%.53s",b3)
instead. Before slicing, that syntax gives you 53 unicode characters. After slicing, it's 53 bytes from start of string.
i even chain them up myself as if they're gensub() even though it's good ole' sub() :
if (sub(reANY340357,"&\301",z)||3==b) { sub((x=sprintf("%.1s",(y=sprintf("%.3s",z))sub(y,"",z)))"\301","",y)
And once you're done with everything you need, a quick gsub(/\300|\301/, "") will restore you the proper UTF8 string.
Hope this is useful =)
Upvotes: 1
Reputation: 2841
for non-unicode aware versions of awk,
gawk -b/ LC_ALL=C /mawk/mawk2 'BEGIN {
reUTF8="([\\000-\\177]|" \
"[\\302-\\337][\\200-\\277]|" \
"\\340[\\240-\\277][\\200-\\277]|" \
"\\355[\\200-\\237][\\200-\\277]|" \
"[\\341-\\354\\356-\\357][\\200-\\277]" \
"[\\200-\\277]|\\360[\\220-\\277]" \
"[\\200-\\277][\\200-\\277]|" \
"[\\361-\\363][\\200-\\277][\\200-\\277]" \
"[\\200-\\277]|\\364[\\200-\\217]" \
"[\\200-\\277][\\200-\\277])" }'
Set this regex. You should be able to get total UTF8-compliant character count as counted by gnu-wc -lcm
, even for purely binary files like mp3s or mp4s or compressed gz/xz/zip that. As long as your data itself is UTF8-compliant, then this will count it, as specified in Unicode 13.
Your locale settings don't matter here whatsoever, nor is your platform, OS version, awk version, or awk variant.
$ echo; time pvE0 < MV84/*BLITZE*webm | gwc -lcm
in0: 449MiB 0:00:10 [44.4MiB/s] [44.4MiB/s] [================================================>] 100%
1827289 250914815 471643928
real 0m10.188s
user 0m10.075s
sys 0m0.352s
$ echo; time pvE0 < MV84/*BLITZE*webm | mawk2x 'BEGIN { FS = "^$"} { bytes += lengthB0(); chars += lengthC0(); } END { print --NR, chars+NR, bytes+NR }'
in0: 449MiB 0:00:16 [27.0MiB/s] [27.0MiB/s] [================================================>] 100%
1827289=250914815=471643928
real 0m16.756s
user 0m16.621s
sys 0m0.449s
the file being tested is a 449 MB .webm
music video clip from youtube that's 3840x2160 VP9 + Opus
codecs. not too shabby for an interpreted scripting language to be this close to compiled C-binaries.
And it's only this slow for the hideously long regex to account for invalid bytes. If you're extremely sure your data is fully UTF8 compliant text, you can further optimize that regex so that mawk2 can go faster than both gnu-wc and bsd-wc :
$ brc; time pvE0 < "${m3t}" | awkwc4m
in0: 1.85GiB 0:00:14 [ 128MiB/s] [ 128MiB/s] [================================================>] 100%
12,494,275 lines 1,285,316,715 utf8 (349,725,658 uc) 1,891.656 MB ( 1983544693) /dev/stdin
real 0m14.753s <—- Custom Bash function that's entirely AWK
$ brc; time pvE0 < "${m3t}" |gwc -lcm
in0: 1.85GiB 0:00:28 [67.3MiB/s] [67.3MiB/s] [================================================>] 100%
12494275 1285316715 1983544693
real 0m28.165s <—— GNU WC
$ brc; time pvE0 < "${m3t}" |wc -lcm
in0: 1.85GiB 0:00:22 [85.5MiB/s] [85.5MiB/s] [================================================>] 100%
12494275 1285316715
real 0m22.181s <—— BSD WC
ps : "${m3t}" is a 1.85GB flat .txt file that's 12.5 million rows, and 13 fields each, filled to the brim with multibyte unicode characters (349.7 million of them).
gawk -e (in unicode mode) will complain about that regex. To circumvent that annoyance, use this regex which is the same as the one above, but expanded out to make gawk -e happy
Upvotes: 1
Reputation: 2841
Quote from the same wikipedia page above :
Fallback and auto-detection: Only a small subset of possible byte strings are a valid UTF-8 string: the bytes C0, C1, and F5 through FF cannot appear, and bytes with the high bit set must be in pairs, and other requirements.
in octal code that means xC0 = \300, xC1 = \301 and xF5 = \365 -> xFF = \377 being non-valid UTF-8.
Knowing that this space isn't valid UTF-8 is plenty useful in terms of wiggle room for one to insert custom delimiters inside any string :
pick any of those bytes, say \373, and once a quick if statement is used to verify it doesn't exist for that line, you can now perform custom text manipulation tricks of your choice, with a single-byte delimiter, even if it involves inserting them right in between the UTF8 bytes for a single code point, and it won't ruin the unicode at all. once you're done with the logic block, simply use a quick gsub( ) to remove all traces of it.
If that byte (\373
ie \xFB
) exist, well, you're likely encountering either a binary file, or partially corrupted UTF8 text data.
One use case, such as in my own modules, is a UTF-8 code-point-level-safe* substr( ) function. So instead of manually counting out the points 1 at a time, first use regex to count out max bytes of any code-point. Let's say 3-bytes (since 4-bytes ones are still rare in practice).
Then i apply 1 pad of \373 next to the 2-byte ones (I pad it to the left of [\302-\337]
), and 2 pads of it, i.e. \373\373
, next to ASCII ones, and voila, now all UTF8 code points have a fixed width, so a substr( )
becomes a mere multiplication exercise of it.
run a byte-level substr( )
on those start and end points, apply gsub(/[\373]+/, "", s)
to throw away all the padding bytes, and now you have a usable* UTF-8-safe substr( ) function for all the variants of awk that aren't unicode-aware. This approach also works for multi-line records, and absolutely does not affect how FS and RS interacts with the record.
(if u need 4-bytes, just pad more)
*i haven't incorporated any fancy logic to account for code-points that are post-decomposition components that supposedly grouped together as a single logical unit for string manipulation purposes.
Upvotes: 1
Reputation: 141165
You seem to be targeting UTF-8 specifically. Indeed first multibyte character in UTF-8 encoding starts 0b11xxxxxx
and the next byte needs to be 0b10xxxxxx
where x
represents any value (from wikipedia).
So you can detect such sequence with sed
by matching the hex ranges and exit with nonzero exit status if found:
LC_ALL=C sed -n '/[\xC0-\xFF][\x80-\xBF]/q1'
Ie. match bytes in ranges [0b11000000-0b11111111][0b10000000-0b10111111]
.
I think \x??
and q
are both GNU extensions to sed
.
Upvotes: 5
Reputation: 28416
The best answer is imho actually the one with grep
provided by Sundeep in the comment. You should try to get that working. The answer below makes use of sed in a similar way. I will probably delete it, as it's really doesn't add anything to grep
's solution.
What about this?
[[ -z "$(LANG=C sed -z '/[\x80-\xFF]/d' <(echo -e 'one\ntwo\nth⌫ree'))" ]]
echo $?
<(echo -e 'one\ntwo\nth⌫ree')
is just an example file with a multibyte character in it[[ -z string ]]
returns 0 or 1 if the string has length zero.Upvotes: 1