Reputation: 11

awk optimised for multi delimeter on single record

I have a file with more than 2 bilion record in it.

It contains record with multi seperator as , OBO- OBI- CSP- and TICK

My lines in file B2.txt are:

917354000000,SUD=CAT-10&DBSG-1&BS3G-1&TS11-1&TS21-1&TS22-1&RSA-1&CSP-69&NAM-0&SCHAR-4&PWD-0000&OFA-0&OICK-134&HOLD-1&MPTY-1&CLIP-1&CFU-1&CFB-1&CFNRY-1&CFNRC-1&CAW-1&SOCFU-1&SOCFB-1&SOCFRY-1&SOCFRC-1&SODCF-0&SOSDCF-4&SOCB-0&SOCLIP-0&SOCLIR-0&SOCOLP-0;

917354000004,SUD=CAT-10&DBSG-1&OBO-2&OBR-2&BS3G-1&TS11-1&TS21-1&TS22-1&RSA-4&PRBT-1&NAM-0&SCHAR-8&PWD-0000&OFA-6&HOLD-1&CLIP-1&CFU-1&CFB-1&CFNRY-1&CFNRC-1&CAW-1&SOCFU-0&SOCFB-0&SOCFRY-0&SOCFRC-0&SODCF-0&SOSDCF-4&SOCB-0&SOCLIP-0&SOCLIR-0&SOCOLP-0;

My code is taking almost more than 2 days to run... Below is the code.

#!/usr/bin/sh

echo "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value" > tt3.txt

while read i
do
MSISDN=`echo $i | awk -F"," '{ print $1}'`;
Complete_Info=`echo $i | awk -F"," '{ print $2}'`;
OBO_Value=`echo $Complete_Info | awk -F"OBO-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
OBI_Value=`echo $Complete_Info | awk -F"OBI-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
CSP_Value=`echo $Complete_Info | awk -F"CSP-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
TICK_Value=`echo $Complete_Info | awk -F"TICK-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`

echo $MSISDN,$OBO_Value,$OBI_Value,$TICK_Value,$CSP_Value >> tt3.txt;

done < B2.txt

Is it possible to optimise this code with awk so the output file contains as follows 917354000000,,69,

Upvotes: 1

Answers (5)

user3442743

Reputation:

Another awk way

awk 'BEGIN{OFS=FS=",";print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"}
     NF{match($2,/CSP-[1-9]+/);a=substr($2,RSTART+4,RLENGTH-4)
     match($2,/TICK-[1-9]+/);b=substr($2,RSTART+5,RLENGTH-5)
     match($2,/OBI-[1-9]+/);c=substr($2,RSTART+4,RLENGTH-4)
     match($2,/OBO-[1-9]+/);d=substr($2,RSTART+4,RLENGTH-4)
     print $1,d,c,b,a}
' file

Produces

MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value
917354000000,,,,69
917354000004,2,,,

I think its pretty self explanatory but if you need anything explaining just ask.

EDIT:

Here it is using a function

awk 'BEGIN{OFS=FS=",";print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"}
     function f(g){match($2,g"-[1-9]*");return (substr($2,RSTART+4,RLENGTH-4))}
     NF{a=f("OBO");b=f("OBI");c=f("TICK");d=f("CSP");print $1,a,b,c,d} ' file

Bit neater

awk 'BEGIN{
         OFS=FS=","
         print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
     }

     function GetVal(Str){
         match($2,Str"-[1-9]*")
         return (substr($2,RSTART+4,RLENGTH-4))
     }

     NF{
         a=GetVal("OBO")
         b=GetVal("OBI")
         c=GetVal("TICK")
         d=GetVal("CSP")
         print $1,a,b,c,d} ' file

Decided to check the speed of each of the scripts on here for 10 000 rows

 Mine(functions) -        real 0m0.773s user 0m0.755s sys 0m0.016s 
 Mine(non-funct) -        real 0m0.306s user 0m0.295s sys 0m0.009s
 Scrutinizer -            real 0m0.400s user 0m0.392s sys 0m0.008s 
 Martin -                 real 0m0.298s user 0m0.291s sys 0m0.006s

The function one is significantly slower.

Upvotes: 3

Ed Morton

Reputation: 203995

I used the same vars as Scrutinizer so it's easy to see the differences with this similar approach that doesn't need an extra array and doesn't call split() on every field:

$ cat tst.awk
BEGIN{
    FS="[-&,]"
    OFS=","
    print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
{
    delete A            # or split("",A) in non-gawk
    for (i=2; i<NF;i+=2)
        A[$i] = $(i+1)
    print $1,A["OBO"],A["OBI"],A["TICK"],A["CSP"]
}

$ awk -f tst.awk file
MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value
917354000000,,,,69
917354000004,2,,,

Since @Jidders solution with mawk seems to be fastest, I wondered how this would compare:

BEGIN{OFS=FS=","; print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"}
NF {
    a=b=c=d=$2
    sub(/.*CSP-/,"",a);  sub(/[^0-9].*/,"",a)
    sub(/.*TICK-/,"",b); sub(/[^0-9].*/,"",b)
    sub(/.*OBI-/,"",c);  sub(/[^0-9].*/,"",c)
    sub(/.*OBO-/,"",d);  sub(/[^0-9].*/,"",d)
    print $1,d,c,b,a
}

Similar approach but just using 2 subs()s instead of match() + substr(). The result is that that's significantly slower than my original attempt and Jidders:

$ time awk -f ed.awk file10k > ed.out
real    0m0.468s
user    0m0.405s
sys     0m0.030s

$ time awk -f ed2.awk file10k > ed2.out
real    0m1.092s
user    0m1.045s
sys     0m0.046s

$ time awk -f jidder.awk file10k > jidder.out
real    0m0.218s
user    0m0.124s
sys     0m0.061s

I guess mawk must have some serious optimization for match()+substr()!

Ah, I just realized what the difference is - string manipulation is notoriously slow in awk (slower than I/O) and the above 2-sub()s solution modifies each string variable twice.

Upvotes: 2

Scrutinizer

Reputation: 9936

Another awk:

awk '
  BEGIN {
    FS="[,&;]"
    OFS=","
    print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
  }

  NF { 
    split(x,A)                # a way of clearing array A, gawk can use delete A
    for(i=2; i<=NF; i++) {
      split ($i,F,/-/)
      A[F[1]]=F[2]
    }
    print $1,A["OBO"],A["OBI"],A["TICK"],A["CSP"]
  }
' B2.txt > tt3.txt

-------

After all the test results I decided to test with 1M records of which 25% was an empty line. I tested on OSX 10.9 with the following awk versions:

awk (BSD awk) v. 20070501)
mawk v. 1.3.4 20100625
gawk 4.0.2

I left out gawk 3 results, since I know from experience that they are usually disappointing and it usually finishes dead last..

Martin's solution was faster with BSD awk and GNU awk, but the opposite was true with mawk, where my solution was 25% faster. Ed's improvement by not using a split() function further improved speed by some 30% and with mawk it took only 7.225s, half of the time that was needed for Martin's solution.

But the real kicker turned out to be Jidders first solution with the match() and without the use of a function. With mawk it did it in a whopping 1.868s

So YMMV, but that best solution speed wise appears to be Jidder's first solution in combination with mawk..

RESULTS:

       Martin  Scrutinizer  Ed Morton  Jidder (non-function version)
BSDawk
real  0m25.008s  1m51.424s  0m38.566s  0m17.945s
user  0m24.545s  1m47.791s  0m37.662s  0m17.485s
sys   0m0.117s   0m0.824s   0m0.120s   0m0.117s

mawk
real  0m14.472s  0m11.618s  0m7.225s   0m1.868s
user  0m13.922s  0m11.091s  0m6.988s   0m1.759s
sys   0m0.117s   0m0.116s   0m0.093s   0m0.084s

gawk
real  0m33.486s  1m16.490s  0m30.642s  0m17.201s
user  0m32.816s  1m14.874s  0m30.041s  0m16.689s
sys   0m0.134s   0m0.219s   0m0.116s   0m0.131s

Upvotes: 4

martin

Reputation: 3249

Here is an awk script that worked for your input:

BEGIN { 
   OFS=","
   FS=",|&|;|-"
   print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}

{ 
   obi=""
   tick=""
   csp=""
   obo=""

   for (i=4; i<=NF; i+=2) {
     if( $i == "OBO" ) { 
       obo=$(i+1) 
     } else if ($i == "OBI") {
       obi=$(i+1)
     } else if ($i == "CSP") { 
       csp=$(i+1)
     } else if ($i == "TICK") { 
       tick=$(i+1)
     }
   } 

   print $1, obo, obi, tick, csp
}

gave

MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value
917354000000,,,,69
917354000004,2,,,

I took advantage of the fact, that your data and input seems to be alternating every two steps.

For completeness let me mention my benchmark for 10 million rows with all solutions updated:

mawk                  gawk
Ed Morton
real    1m19.259s     real  3m36.107s
user    1m17.987s     user  3m35.163s
sys      0m0.706s     sys   0m0.936s 

Martin  
real    2m13.875s     real  4m37.112s
user    2m12.680s     user  4m36.032s
sys     0m0.848s      sys   0m0.954s 

Scrutinizer
real    1m48.755s     real  6m40.202s
user    1m47.513s     user  6m39.148s
sys      0m0.894s     sys   0m1.041s 

Jidder (non-function version)
real    0m16.403s     real  3m18.342s
user    0m15.626s     user  3m17.004s
sys     0m0.632s      sys   0m0.968s

Clearly: use Jidders solution with mawk, this saves a lot of time.

Upvotes: 2

Jotne

Reputation: 41460

You can simplify it some like this:

OBO_Value=$(awk -F"OBO-" '/OBO-/ {split($2,a,"&");print a[1];exit}' <<< $Complete_Info)

This do all in one go, so should be some faster. Also the exit makes the awk stop if line with value is found.

PS use parentheses instead of old and outdated back-tics

Upvotes: 0

awk optimised for multi delimeter on single record

Answers (5)

-------

RESULTS:

Related Questions