Aman Singh
Aman Singh

Reputation: 75

Optimise Bash scripting for large data

I have written a bash script where trying to obtain a new file from two files.

File1:

1000846364118,9,369,9901,0,2020.05.20 13:20:52,2020.07.14 16:38:11,2021.03.14 00:00:00,U,2020.07.14 16:38:11
1000683648398,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,U,2020.01.01 01:25:05
1000534726081,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,X,2020.01.01 01:25:05

File2:

1000846364118;0;;2021.04.04;9914;100084636;ISATD;U;TEST;1234567890;2;;0;0;0;0;2020.10.12.00:00:00;0;0
1000830686890;0;;2021.03.02;9807;100083068;ISATD;U;TEST;1234567891;2;;0;0;0;0;2020.10.12.00:00:01;0;0
1000835819335;0;;2021.03.21;9990;100083581;ISATD;U;TEST;1234567892;2;;0;0;0;0;2020.10.12.00:00:03;0;0
1000683648398;0;;2020.10.31;9829;100068364;ISATD;U;TEST;1234567893;2;;0;0;0;0;2020.10.12.00:00:06;0;0

New file will have rows from file1 only which is having pattern 'U' in it with extra column where 10th field(123456789X) of file2 will be there. So my final output will be like this:

1000846364118,9,369,9901,0,2020.05.20 13:20:52,2020.07.14 16:38:11,2021.03.14 00:00:00,U,2020.07.14 16:38:11,1234567890
1000683648398,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,U,2020.01.01 01:25:05,1234567893

My script is below and working fine but the only issue is the data with which I am plying is huge and to generate the file output it is taking too much time. I put a timespan after every step and found that for loop portion is taking hours to generate few KB data wherein I am playing with few hundred MBs of data. Need help to optimise it.

cat /dev/null > new_file

used_Serial_Number=`grep U file1 | awk -F "," '{print $1}'`

echo "Serial no extracted  at `date`"  # Till this portion is getting completed in 2-3mins

for i in $used_Serial_Number; do

msisdn=`grep $i file2 | awk -F ";" '{print $10}'`

grep $i file1 | awk -v msisdn=$msisdn -F "," 'BEGIN { OFS = "," } { print $0 , msisdn }' >> new_file

done

Upvotes: 1

Views: 54

Answers (1)

RavinderSingh13
RavinderSingh13

Reputation: 133458

Could you please try following, written and tested with shown samples in GNU awk. In case your 9th field of Input_file1 could be u OR U then change from $9=="U" TO tolower($9)=="u" for matching both cases.

awk '
BEGIN{
  FS=";"
  OFS=","
}
FNR==NR{
  a[$1]=$10
  next
}
($1 in a) && $9=="U"{
  print $0,a[$1]
}
' Input_file2 FS="," Input_file1

Explanation: Adding detailed explanation for above.

awk '                    ##Starting awk program from here.
BEGIN{                   ##Starting BEGIN section from here.
  FS=";"                 ##Setting FS as ; here.
  OFS=","                ##Setting OFS as , here.
}
FNR==NR{                 ##Checking condition if FNR==NR which will be TRUE when Input_file2 is being read.
  a[$1]=$10              ##Creating array a with index $1 and value is $10 here.
  next                   ##next will skip all further statements from here.
}
($1 in a) && $9=="U"{    ##Checking if $1 is in a and 9th field is U then do following.
  print $0,a[$1]         ##Printing current line along with value of a with index of $1 here.
}
' file2 FS="," file1     ##Mentioning Input_file2 then setting FS as , and mentioning Input_file1 here.

Upvotes: 2

Related Questions