Reputation: 117

Help with awk to truncate and pad

I have a long list of Unicode values that are semi-colon delimited. Here's an example:

E0027;TAG APOSTROPHE;Cf;0;BN;;;;;N;;;;;

All I need is the "E0027;" part.

So I first need to drop everything in the line AFTER the first semicolon, but in some cases the semicolon is after 4 digits, in other cases, (as above) it's after 5. If it were the same throughout I'd just truncate after a fixed number of chars. I've found lots of examples for doing various manipulations with awk but no regular expressions that seem to fit this particular case. Does anyone know what the proper syntax is? The logic is merely to keep everything BEFORE the first semicolon and to drop everything after it.

Then, for the resulting file, I need to add a leading 0 to the line if the number is only 4 chars. So for example:

8A9B;

Should become:

08A9B;

But the 5 digit values (such as the first example) should remain as is...no leading zero.

(Though would an extra leading zero make a difference if I'm using these values in HTML? Would it matter if I had:

&#x0E0027

Instead of:

&#xE0027

If these will be parsed identically by PHP and won't make a difference, I guess the latter part isn't so important (though with thousands of extra zeros it will bloat the size of the code.)

Thank you for any help in advance!

Upvotes: 1

Answers (5)

shellter

Reputation: 37298

Edit: Awk code fixed to leave last ';' in place.

print -- "E0027;TAG APOSTROPHE;Cf;0;BN;;;;;N;;;;;
0027;TAG APOSTROPHE;Cf;0;BN;;;;;N;;;;;" \
| awk '{
        #dbg print "$0=" $0
        sub(/;.*$/, ";")  # fixed here
        len=length($0)
        if (len == 5) {print "0" $0} # this was 4, now 5 with ';'
        else if (len == 6) {print $0} # 5 changed to 6
        else {print "error in input: found len=" len " in XX" $0 "xx"}
}'

you can replace the print -- "... " | with cat file | OR avoid a UUOC award and remove print -- "..." | AND append inFileName > outFileName after the last ' of the awk program.

I don't know the anwser to your HTML question.

Upvotes: 1

Dr. belisarius

Reputation: 61046

BEGIN {FS="\;"}

{print substr("0000" $1 FS, length($1),6)}

Input:

E0027;TAG APOSTROPHE;Cf;0;BN;;;;;N;;;;;
8A9B;TAG APOSTROPHE;Cf;0;BN;;;;;N;;;;;

Out:

E0027;
08A9B;

Running at ideone.

Upvotes: 0

Decent Dabbler

Reputation: 22773

I'm no *nix man, so not really familiar with awk. However, if a php solution is acceptable, how about this:

$values = array();
$lines = file( '/path/to/file', FILE_SKIP_EMPTY_LINES );
foreach( $lines as $line )
{
    // get part before first occurence of ;
    $value = strstr( $line, ';', true ); 
    // pad the value, if applicable
    $value = str_pad( $value, 5, '0', STR_PAD_LEFT );
    // put it in the result array
    $values[] = $value;
}

And if reading the entire file into memory at once is unacceptable, you could read it line by line, with fopen(), fgets(), etc. of course.

Upvotes: 0

kurumi

Reputation: 25609

$ echo "E0027;TAG APOSTROPHE;Cf;0;BN;;;;;N;;;;;" | awk -F";" '{ printf "%05s\n",$1 }'
E0027

Upvotes: 0

SiegeX

Reputation: 140437

awk -F';' '$0=length($1)<5?"0" $1 FS:$1 FS'

Proof of Concept

$ echo "8A9B;TAG APOSTROPHE;Cf;0;BN;;;;;N;;;;;" | awk -F';' '$0=length($1)<5?"0" $1 FS:$1 FS'
08A9B;

$ echo "E0027;TAG APOSTROPHE;Cf;0;BN;;;;;N;;;;;" | awk -F';' '$0=length($1)<5?"0" $1 FS:$1 FS'
E0027;

Upvotes: 2

Help with awk to truncate and pad

Answers (5)

Proof of Concept

Related Questions