TonyW
TonyW

Reputation: 18885

AWK - loop through a large CSV file and print a new CSV containing records that pass criteria

I am trying to parse a large CSV (about 90 MB) that has a header section (130 rows, single column). Line 131 is the column header that has names for individual column (20 columns in total).

I don't need to parse the first 9 columns, the data beginning from column #10 through colum#20 are in the following format (reformatted as column for easier view, but each row below is actually in an individual column in the CSV)

0/0:3,0:9:0,9,90

./.

0/1:6,3:49:73,0,111

0/1:13,3:99:103,0,254

0/1:6,6:45:56,3,124

0/1:2,2:34:43,0,51

0/1:80,0:87:252,5,1882

0/1:25,12:99:139,286,3281

./.

0/1:6,8:64:64,0,130

0/1:4,5:65:109,0,114

Take this "0/0:3,0:9:0,9,90" as an example, the criteria is: if the two numbers (3,0) as highlighted add up to >=20 AND the second number (0) is greater or equal to 10, then print it out to new CSV. If does not pass the criteria, just print "./." in the new CSV. The cells that have "./." already will remain "./." in the new CSV.

I have asked a part of the question in this thread use awk to process a csv (tab delimited) line by line and, thanks to Kent, I now know how to apply the criteria to a single cell, but I need to loop through 10 columns and apply the same criteria. Below is what I have come up with, but apparently, I need more help with finishing this. Any help/suggestion will be appreciated.

BEGIN {
-F'\t'; -v OFS='\t'

for (i=10; i<=20; i++)   
 {
  t=$i; 
  split(t,a,/[:,]/)}(a[2]+a[3])>=20 && a[3]>=10
  }

}

Upvotes: 0

Views: 1314

Answers (1)

shellter
shellter

Reputation: 37298

It would help provide an exact answer if your question included example output from the given input. But overall, you're very close to your solution ... except ... (t,a,/[:,]/)}? where's the opening '{' char? You have an imbalance of { and }s in your code. You should be getting an error message and depending on which awk, it will most likely point you to what it can't parse.

BEGIN { -F"\t"; OFS="\t" }
# -----------missing ----^--
# awk doesn't like '\t' quoting, use "\t"
# the -v is only for the command line, not in a BEGIN block

# awk executes whatever is inside a { } block like illustrated below
# for each line of data in the file
# 
# not clear if this is what you want
{
# --- note, missing opening { char

    for (i=10; i<=20; i++) {
      t=$i; 
      #old ver split(t,a,/[:,]/)}(a[2]+a[3])>=20 && a[3]>=10
      # new
      split(t,a,/[:,]/)
       if ( (a[2]+a[3])>=20 && a[3]>=10 ) {
          printf("%s\t", $i)
       }  # end if
     }    # end for i
     printf("\n")
}         # end awk loop 
          # that processes each line
          # of data from a a file    

To print just part of a line, as I understand your question, you'll have to rely on awk more granular printf("%s", "string") functionality. Note that I'm passing $i as the parameter to print, you may need to use one of your array variables a[2] or whatever.

IHTH

Upvotes: 2

Related Questions