Reputation: 18885
I am trying to parse a large CSV (about 90 MB) that has a header section (130 rows, single column). Line 131 is the column header that has names for individual column (20 columns in total).
I don't need to parse the first 9 columns, the data beginning from column #10 through colum#20 are in the following format (reformatted as column for easier view, but each row below is actually in an individual column in the CSV)
0/0:3,0:9:0,9,90
./.
0/1:6,3:49:73,0,111
0/1:13,3:99:103,0,254
0/1:6,6:45:56,3,124
0/1:2,2:34:43,0,51
0/1:80,0:87:252,5,1882
0/1:25,12:99:139,286,3281
./.
0/1:6,8:64:64,0,130
0/1:4,5:65:109,0,114
Take this "0/0:3,0:9:0,9,90" as an example, the criteria is: if the two numbers (3,0) as highlighted add up to >=20 AND the second number (0) is greater or equal to 10, then print it out to new CSV. If does not pass the criteria, just print "./." in the new CSV. The cells that have "./." already will remain "./." in the new CSV.
I have asked a part of the question in this thread use awk to process a csv (tab delimited) line by line and, thanks to Kent, I now know how to apply the criteria to a single cell, but I need to loop through 10 columns and apply the same criteria. Below is what I have come up with, but apparently, I need more help with finishing this. Any help/suggestion will be appreciated.
BEGIN {
-F'\t'; -v OFS='\t'
for (i=10; i<=20; i++)
{
t=$i;
split(t,a,/[:,]/)}(a[2]+a[3])>=20 && a[3]>=10
}
}
Upvotes: 0
Views: 1314
Reputation: 37298
It would help provide an exact answer if your question included example output from the given input. But overall, you're very close to your solution ... except ... (t,a,/[:,]/)}
? where's the opening '{' char? You have an imbalance of {
and }
s in your code. You should be getting an error message and depending on which awk, it will most likely point you to what it can't parse.
BEGIN { -F"\t"; OFS="\t" }
# -----------missing ----^--
# awk doesn't like '\t' quoting, use "\t"
# the -v is only for the command line, not in a BEGIN block
# awk executes whatever is inside a { } block like illustrated below
# for each line of data in the file
#
# not clear if this is what you want
{
# --- note, missing opening { char
for (i=10; i<=20; i++) {
t=$i;
#old ver split(t,a,/[:,]/)}(a[2]+a[3])>=20 && a[3]>=10
# new
split(t,a,/[:,]/)
if ( (a[2]+a[3])>=20 && a[3]>=10 ) {
printf("%s\t", $i)
} # end if
} # end for i
printf("\n")
} # end awk loop
# that processes each line
# of data from a a file
To print just part of a line, as I understand your question, you'll have to rely on awk
more granular printf("%s", "string")
functionality. Note that I'm passing $i
as the parameter to print, you may need to use one of your array variables a[2]
or whatever.
IHTH
Upvotes: 2