Asgard
Asgard

Reputation: 622

awk clear array values after reading each line of csv

The problem I have is related to a previous post. Now what I have right now is:

  1. One 9-million-rows CSV file containing fields arranged randomly like:

    192.168.12.23,62,LOCAL,341993,23/10/2012
    192.168.12.25,11,MONLOCAL$MONREMOTE,33222$56,22/10/2012$18/10/2012
    192.168.12.678,14,MONLOCAL,341993,22/10/2012
    
    192.168.12.83,18, , ,
    192.168.12.21,49,LOCAL$REMOTE,19316$15253,22/10/2012$22/10/2012
    192.168.12.79,52,REMOTE,1180306134,19/10/2012
    192.168.12.41,44,MONLOCAL$MONREMOTE,1865871$383666,22/10/2012$22/10/2012
    192.168.12.29,23,MONREMOTE,151145,18/10/2012
    

    Where as you probably already observed, there are 2 field separators, , and $:

    • field 1 = IP Address

    • field 2 = An unique ID

    • field 3 = connection that IP has used (e.g LOCAL, REMOTE, MONLOCAL, or MONREMOTE)

    • field 4 = can be also the connection IP has used (e.g LOCAL, REMOTE, MONLOCAL, or MONREMOTE) or the value related to field 3

    • field 5 = if field 4 is LOCAL or REMOTE or MONLOCAL or MONREMOTE, then field 5 represent the value of field 3 else it represent the timestamp related to field 3

    • field 6 = if field 4 is LOCAL or REMOTE or MONLOCAL or MONREMOTE, then field 6 represent the value of field 4 else it will not exist.

    • field 7 = if field 4 is LOCAL or REMOTE or MONLOCAL or MONREMOTE, then field 7 represent the timestamp of field 3 else it will not exist.

    • field 8 = if field 4 is LOCAL or REMOTE or MONLOCAL or MONREMOTE, then field 7 represent the timestamp of field 4 else it will not exist.

  2. I need to process the file above and output the following format, having the fields always in following order (10 fields):

    IP,ID,MONLOCAL_value,MONLOCAL_timestamp,LOCAL_value,LOCAL_timestamp,MONREMOTE_value,MONREMOTE_timestamp,REMOTE_value,REMOTE_timestamp like:

    192.168.12.23,62, , ,341993,23/10/2012, , , , 
    192.168.12.25,11,33222,22/10/2012, , , , ,56,18/10/2012
    192.168.12.678,14,341993,22/10/2012, , , , , ,  
    192.168.12.83,18, , , , , , , , 
    192.168.12.21,49, , ,19316,22/10/2012, , ,15253,22/10/2012
    192.168.12.79,52, , , , , , ,1180306134,19/10/2012
    192.168.12.41,44,1865871,22/10/2012, , ,383666,22/10/2012, , 
    192.168.12.29,23, , , , ,151145,18/10/2012, , 
    
  3. I have the following script to process the file from above:

    nawk 'BEGIN {
        while (getline < "'"$data"'" > 0)
        {
        {FS = "[,,$]"; OFS=","}
        split($0,flds)
           {if ($4 ~ /LOCAL|REMOTE|MONLOCAL|MONREMOTE/) {
                if ($3 ~ /MONLOCAL/) {
                        MONREMOTE_time=flds[8];
                        MONREMOTE_value=flds[6];
                        MONLOCAL_time=flds[7];
                        MONLOCAL_value=flds[5]; }
                if ($3 ~ /MONREMOTE/) {
                        MONREMOTE_time=flds[7];
                        MONREMOTE_value=flds[5];
                        REMOTE_value=flds[6];
                        REMOTE_time=flds[8]; }
                if ($3 ~ /REMOTE/) {
                        REMOTE_value=flds[5];
                        REMOTE_time=flds[7];
                        LOCAL_value=flds[6];
                        LOCAL_time=flds[8]; }
        } else {
                if($3 ~ /MONLOCAL/) {
                        MONLOCAL_value=flds[4];
                        MONLOCAL_time=flds[5]; }
                if ($3 ~ /MONREMOTE/) {
                        MONREMOTE_value=flds[4];
                        MONREMOTE_time=flds[5]; }
                if ($3 ~ /LOCAL/) {
                        LOCAL_value=flds[4];
                        LOCAL_time=flds[5]; }
                if ($3 ~ /REMOTE/) {
                        REMOTE_value=flds[4];
                        REMOTE_time=flds[5]; }
        }
      }
      {print MONLOCAL_value",MONLOCAL_time,LOCAL_value,LOCAL_time,MONREMOTE_value,MONREMOTE_time,REMOTE_value,REMOTE_time;}
     }
    }'
    
  4. The bad part here is that as you can observe below, the output is not as expected since I'm unable to clear the array values after each line is read, or find a solution to charge the array elements dynamically for each line:

    4915779000211,62, , ,341993,23/10/2012, , , , ,
    4915779000212,11,33222,22/10/2012,341993,23/10/2012,56,18/10/2012, , 
    491639000591,14,341993,22/10/2012,341993,23/10/2012, , , ,  
    491779001768,18,319307448,18/10/2012,19316,22/10/2012,383666,22/10/2012,1180306134,19/10/2012
    4915779000213,49,3849259,05/10/2012,19316,22/10/2012,56,18/10/2012,15253,22/10/2012
    491779000758,52,9356828,08/10/2012,19316,22/10/2012,56,18/10/2012,1180306134,19/10/2012
    4915779000214,44,1865871,22/10/2012,19316,22/10/2012,383666,22/10/2012,1180306134,19/10/2012
    491639000221,23,319307448,18/10/2012,19316,22/10/2012,151145,18/10/2012,1180306134,19/10/2012
    

So, if you guys have an idea how to make this awk function to work and give the expected output, I would be grateful.

Upvotes: 2

Views: 1005

Answers (1)

Chris Montanaro
Chris Montanaro

Reputation: 18202

With some modification you can use something similar to the following:

while read line; do 
  if [[ "$line" == *\$* ]]; then 
    echo $line | awk -F',|\$' '{print $1,$2,$5,$6,$3}' >> newfile;
  else
    echo $line | awk -F',' '{print $1,$2,$3,,$4,,$5}' >> newfile;
  fi
done < "/path/to/your/file"

Upvotes: 1

Related Questions