Reputation: 11
I have a file with more than 2 bilion record in it.
It contains record with multi seperator as , OBO- OBI- CSP- and TICK
My lines in file B2.txt are:
917354000000,SUD=CAT-10&DBSG-1&BS3G-1&TS11-1&TS21-1&TS22-1&RSA-1&CSP-69&NAM-0&SCHAR-4&PWD-0000&OFA-0&OICK-134&HOLD-1&MPTY-1&CLIP-1&CFU-1&CFB-1&CFNRY-1&CFNRC-1&CAW-1&SOCFU-1&SOCFB-1&SOCFRY-1&SOCFRC-1&SODCF-0&SOSDCF-4&SOCB-0&SOCLIP-0&SOCLIR-0&SOCOLP-0;
917354000004,SUD=CAT-10&DBSG-1&OBO-2&OBR-2&BS3G-1&TS11-1&TS21-1&TS22-1&RSA-4&PRBT-1&NAM-0&SCHAR-8&PWD-0000&OFA-6&HOLD-1&CLIP-1&CFU-1&CFB-1&CFNRY-1&CFNRC-1&CAW-1&SOCFU-0&SOCFB-0&SOCFRY-0&SOCFRC-0&SODCF-0&SOSDCF-4&SOCB-0&SOCLIP-0&SOCLIR-0&SOCOLP-0;
My code is taking almost more than 2 days to run... Below is the code.
#!/usr/bin/sh
echo "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value" > tt3.txt
while read i
do
MSISDN=`echo $i | awk -F"," '{ print $1}'`;
Complete_Info=`echo $i | awk -F"," '{ print $2}'`;
OBO_Value=`echo $Complete_Info | awk -F"OBO-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
OBI_Value=`echo $Complete_Info | awk -F"OBI-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
CSP_Value=`echo $Complete_Info | awk -F"CSP-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
TICK_Value=`echo $Complete_Info | awk -F"TICK-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
echo $MSISDN,$OBO_Value,$OBI_Value,$TICK_Value,$CSP_Value >> tt3.txt;
done < B2.txt
Is it possible to optimise this code with awk so the output file contains as follows 917354000000,,69,
Upvotes: 1
Views: 131
Reputation:
Another awk way
awk 'BEGIN{OFS=FS=",";print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"}
NF{match($2,/CSP-[1-9]+/);a=substr($2,RSTART+4,RLENGTH-4)
match($2,/TICK-[1-9]+/);b=substr($2,RSTART+5,RLENGTH-5)
match($2,/OBI-[1-9]+/);c=substr($2,RSTART+4,RLENGTH-4)
match($2,/OBO-[1-9]+/);d=substr($2,RSTART+4,RLENGTH-4)
print $1,d,c,b,a}
' file
Produces
MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value
917354000000,,,,69
917354000004,2,,,
I think its pretty self explanatory but if you need anything explaining just ask.
EDIT:
Here it is using a function
awk 'BEGIN{OFS=FS=",";print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"}
function f(g){match($2,g"-[1-9]*");return (substr($2,RSTART+4,RLENGTH-4))}
NF{a=f("OBO");b=f("OBI");c=f("TICK");d=f("CSP");print $1,a,b,c,d} ' file
Bit neater
awk 'BEGIN{
OFS=FS=","
print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
function GetVal(Str){
match($2,Str"-[1-9]*")
return (substr($2,RSTART+4,RLENGTH-4))
}
NF{
a=GetVal("OBO")
b=GetVal("OBI")
c=GetVal("TICK")
d=GetVal("CSP")
print $1,a,b,c,d} ' file
Decided to check the speed of each of the scripts on here for 10 000 rows
Mine(functions) - real 0m0.773s user 0m0.755s sys 0m0.016s
Mine(non-funct) - real 0m0.306s user 0m0.295s sys 0m0.009s
Scrutinizer - real 0m0.400s user 0m0.392s sys 0m0.008s
Martin - real 0m0.298s user 0m0.291s sys 0m0.006s
The function one is significantly slower.
Upvotes: 3
Reputation: 203995
I used the same vars as Scrutinizer so it's easy to see the differences with this similar approach that doesn't need an extra array and doesn't call split() on every field:
$ cat tst.awk
BEGIN{
FS="[-&,]"
OFS=","
print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
{
delete A # or split("",A) in non-gawk
for (i=2; i<NF;i+=2)
A[$i] = $(i+1)
print $1,A["OBO"],A["OBI"],A["TICK"],A["CSP"]
}
$ awk -f tst.awk file
MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value
917354000000,,,,69
917354000004,2,,,
Since @Jidders solution with mawk seems to be fastest, I wondered how this would compare:
BEGIN{OFS=FS=","; print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"}
NF {
a=b=c=d=$2
sub(/.*CSP-/,"",a); sub(/[^0-9].*/,"",a)
sub(/.*TICK-/,"",b); sub(/[^0-9].*/,"",b)
sub(/.*OBI-/,"",c); sub(/[^0-9].*/,"",c)
sub(/.*OBO-/,"",d); sub(/[^0-9].*/,"",d)
print $1,d,c,b,a
}
Similar approach but just using 2 subs()s instead of match() + substr(). The result is that that's significantly slower than my original attempt and Jidders:
$ time awk -f ed.awk file10k > ed.out
real 0m0.468s
user 0m0.405s
sys 0m0.030s
$ time awk -f ed2.awk file10k > ed2.out
real 0m1.092s
user 0m1.045s
sys 0m0.046s
$ time awk -f jidder.awk file10k > jidder.out
real 0m0.218s
user 0m0.124s
sys 0m0.061s
I guess mawk must have some serious optimization for match()+substr()!
Ah, I just realized what the difference is - string manipulation is notoriously slow in awk (slower than I/O) and the above 2-sub()s solution modifies each string variable twice.
Upvotes: 2
Reputation: 9936
Another awk:
awk '
BEGIN {
FS="[,&;]"
OFS=","
print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
NF {
split(x,A) # a way of clearing array A, gawk can use delete A
for(i=2; i<=NF; i++) {
split ($i,F,/-/)
A[F[1]]=F[2]
}
print $1,A["OBO"],A["OBI"],A["TICK"],A["CSP"]
}
' B2.txt > tt3.txt
After all the test results I decided to test with 1M records of which 25% was an empty line. I tested on OSX 10.9 with the following awk versions:
I left out gawk 3 results, since I know from experience that they are usually disappointing and it usually finishes dead last..
Martin's solution was faster with BSD awk and GNU awk, but the opposite was true with mawk
, where my solution was 25% faster. Ed's improvement by not using a split()
function further improved speed by some 30% and with mawk it took only 7.225s, half of the time that was needed for Martin's solution.
But the real kicker turned out to be Jidders first solution with the match()
and without the use of a function. With mawk
it did it in a whopping 1.868s
So YMMV, but that best solution speed wise appears to be Jidder's first solution in combination with mawk
..
S.
Martin Scrutinizer Ed Morton Jidder (non-function version)
BSDawk
real 0m25.008s 1m51.424s 0m38.566s 0m17.945s
user 0m24.545s 1m47.791s 0m37.662s 0m17.485s
sys 0m0.117s 0m0.824s 0m0.120s 0m0.117s
mawk
real 0m14.472s 0m11.618s 0m7.225s 0m1.868s
user 0m13.922s 0m11.091s 0m6.988s 0m1.759s
sys 0m0.117s 0m0.116s 0m0.093s 0m0.084s
gawk
real 0m33.486s 1m16.490s 0m30.642s 0m17.201s
user 0m32.816s 1m14.874s 0m30.041s 0m16.689s
sys 0m0.134s 0m0.219s 0m0.116s 0m0.131s
Upvotes: 4
Reputation: 3249
Here is an awk script that worked for your input:
BEGIN {
OFS=","
FS=",|&|;|-"
print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
{
obi=""
tick=""
csp=""
obo=""
for (i=4; i<=NF; i+=2) {
if( $i == "OBO" ) {
obo=$(i+1)
} else if ($i == "OBI") {
obi=$(i+1)
} else if ($i == "CSP") {
csp=$(i+1)
} else if ($i == "TICK") {
tick=$(i+1)
}
}
print $1, obo, obi, tick, csp
}
gave
MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value
917354000000,,,,69
917354000004,2,,,
I took advantage of the fact, that your data and input seems to be alternating every two steps.
For completeness let me mention my benchmark for 10 million rows with all solutions updated:
mawk gawk
Ed Morton
real 1m19.259s real 3m36.107s
user 1m17.987s user 3m35.163s
sys 0m0.706s sys 0m0.936s
Martin
real 2m13.875s real 4m37.112s
user 2m12.680s user 4m36.032s
sys 0m0.848s sys 0m0.954s
Scrutinizer
real 1m48.755s real 6m40.202s
user 1m47.513s user 6m39.148s
sys 0m0.894s sys 0m1.041s
Jidder (non-function version)
real 0m16.403s real 3m18.342s
user 0m15.626s user 3m17.004s
sys 0m0.632s sys 0m0.968s
Clearly: use Jidders solution with mawk, this saves a lot of time.
Upvotes: 2
Reputation: 41460
You can simplify it some like this:
OBO_Value=$(awk -F"OBO-" '/OBO-/ {split($2,a,"&");print a[1];exit}' <<< $Complete_Info)
This do all in one go, so should be some faster. Also the exit
makes the awk
stop if line with value is found.
PS use parentheses instead of old and outdated back-tics
Upvotes: 0