coconut
coconut

Reputation: 1101

Count bytes in a field

I have a file that looks like this:

ASDFGHJ|ASDFEW|ASFEWFEWAFEWASDFWE FEWFDWAEWA FEWDWDFEW|EWFEW|ASKOKJE
IOJIKNH|ASFDFEFW|ASKDFJEO JEWIOFJS IEWOFJEO SJFIEWOF WE|WEFEW|ASFEWAS

I'm having trouble with this file because it's written in Cyrillic and the database complains about number of bytes (vs number of characters). I want to check if, for example, the first field is larger than 10 bytes, the second field is larger than 30 bytes, etc.

I've been trying a lot of different things: awc, wc... I know with wc -c I can count bytes but how can I retrieve only the lines that have a field that is larger than X?

Any idea?

Upvotes: 1

Views: 719

Answers (3)

Tom Fenech
Tom Fenech

Reputation: 74685

Here's a Perl one-liner that prints the whole line if the field in bytes is longer than the respective member in an array @m:

perl -F'\|' -Mbytes -lane '@m=(10,10,30,10); print if grep { bytes::length $_ > shift @m } @F' file

As the name suggests, bytes::length ignores the encoding and returns the length of each field in bytes. The -a switch to Perl enables auto-split mode, which creates an array @F containing all the fields. I've used the pipe | as the delimiter (it needs escaping with a backslash). The -l switch removes the newline from the end of the line, ensuring that your final field is the correct length.

The -n switch tells Perl to loop through each line in the file. grep filters the array @F on the condition in the block. I'm using shift to remove and return the first element of @m, so that each field in @F is being compared with the respective element in @m. The filtered list will evaluate to true in this context if it contains any elements (i.e. if any of the fields were longer than their limit).

Upvotes: 3

hek2mgl
hek2mgl

Reputation: 158100

To obtain the number of bytes in a certain FIELD on a certain LINE you can issue the following awk command:

awk -F'|' -v LINE=1 -v FIELD=3 'NR==LINE{print $FIELD}' input.txt | wc -c

To print the number of bytes for every field you may use a little loop:

awk -F'|' '{for(i=1;i<NF;i++)print $i}' a.txt | \
while read field ; do 
    nb=$(wc -c <<<"$field")
    echo "$field $nb"

    # Check if the field is too long
    if [ "$nb" -gt 40 ] ; then
        echo "field $field is too long"
        exit 1
    fi
done

Upvotes: 1

jaypal singh
jaypal singh

Reputation: 77145

If you are open to using perl then this could help. I have added comments to make it easier for you to follow:

#!/usr/bin/perl

use strict;
use warnings;
use bytes;

## Change the file to path where your file is located
open my $data, '<', 'file';    

## Define an array with acceptable sizes for each fields
my @size = qw( 10 30 ... );        

LINE: while(<$data>) {         ## Read one line at a time      
    chomp;                     ## Remove the newline from each line read

    ## Split the line on | and store each fields in an array
    my @fields = split /\|/;   

    for ( 0 .. $#fields ) {    ## Iterate over the array

        ## If the size is less than desired size move to next line
        next LINE unless bytes::length($fields[$_]) > $size[$_];  
    }

    ## If all sizes matched  print the line
    print "$_\n";  
}

Upvotes: 3

Related Questions