Anil
Anil

Reputation: 420

how to collapse fields with zeros from a datafile

I have a file with data as follows

col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
1,2,3,4,5,6,7,8,9,10
1,2,1,2,0,1,0,1,0,1
1,1,1,1,0,2,3,0,0,0
5,1,1,0,0,0,0,0,1,0 

I would like to change the delimiters from col6 through column 10 to pipe '|' and the column value would be followed by column name.

Desired Output:

1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,0,col6:1|col8:1|col10:1
1,1,1,1,0,col6:2|col7:3
5,1,1,0,0,col9:1

I tried using the command

awk -F ', *' 'NR==1{for (i=1; i<=NF; i++) hdr[i]=$i; next}
   {for (i=1; i<=NF; i++) if ($i>0) printf "%s%s", ((i>5)?hdr[i] ":":"") $i,
            ((i<NF)? ((i>5)?"|":",") : ORS)}' data.csv

but not getting the result as expected

Output:

1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,col6:1|col8:1|col10:1
1,1,1,1,col6:2|col7:3|5,1,1,col9:1|

The columns that does not contain a zero is ending with '|' and the next line is starting data starts there ! In this example, row 2 data ends with a pipe '|' and row 3 data starts in the row 2. row 4 data ends with a pipe '|'

Can some one help me fix this please

P.S: For people looking for the reason behind all this work, I'm trying to load the data from a csv file in to a framework. Source data has 10 columns and The destination dataset would have 6 columns - first 5 from source as is and the rest as a map. Also, I'll have to make sure that there is no map key with the value as zero and then start the data analysis on the set. This post is to get help for making the data set ready for analysis.

Upvotes: 2

Views: 75

Answers (2)

Fred
Fred

Reputation: 579

The above solution is excellent. Helps me with a similar issue. However, I need to cater for an all-zero case in columns 6 to 10. See the last line of your data below.

col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
1,2,3,4,5,6,7,8,9,10
1,2,1,2,0,1,0,1,0,1
1,1,1,1,0,2,3,0,0,0
5,1,1,0,0,0,0,0,1,0 
5,1,1,0,0,0,0,0,0,0 

This might never happen in your data, however if it does you are left with an inconvenient comma at the end of the line:

1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,0,col6:1|col8:1|col10:1
1,1,1,1,0,col6:2|col7:3
5,1,1,0,0,col9:1
5,1,1,0,0,

To get around it I made change. Here it is, somewhat spread out for clarity:

awk -F ', *' '
NR==1{
    for (i=1; i<=NF; i++) hdr[i]=$i":"
    next
}

{
    for (i=1; i<5; i++) printf("%s,", $i);
    if(i==5) printf("%s", $i);
    b="";
    for (i=6; i<=NF; i++) {
        if ($i>0) {
            if(b=="") b=","; else b="|";
            printf("%s%s",b, hdr[i] $i);
        }
    }
    printf(ORS);
}

Upvotes: 1

John1024
John1024

Reputation: 113994

$ awk -F ', *' 'NR==1{for (i=1; i<=NF; i++) hdr[i]=$i":"; next} {for (i=1; i<=5; i++) printf $i","; b=""; for (i=6; i<=NF; i++) if ($i>0) {printf "%s%s", b, hdr[i] $i; b="|";} printf ORS}' data.csv
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,0,col6:1|col8:1|col10:1
1,1,1,1,0,col6:2|col7:3
5,1,1,0,0,col9:1

Or, written over multiple lines:

awk -F ', *' '
NR==1{
    for (i=1; i<=NF; i++) hdr[i]=$i":"
    next
}

{
    for (i=1; i<=5; i++) printf $i","
    b=""
    for (i=6; i<=NF; i++) if ($i>0) {printf "%s%s", b, hdr[i] $i; b="|";}
    printf ORS
}
' data.csv

How it works

  • NR==1{for (i=1; i<=NF; i++) hdr[i]=$i":"; next}

    For the first line, NR==1, we save each field and a trailing colon into array hdr. Then, the rest of the commands are skipped and we just to the next line.

  • for (i=1; i<=5; i++) printf $i","

    If we get here, we are working or the second are later lines. In this case, we print the first five fields, each followed by a comma.

  • b=""

    We initialize the variable b to the empty string.

  • for (i=6; i<=NF; i++) if ($i>0) {printf "%s%s", b, hdr[i] $i; b="|";}

    For fields 6 to the end, if the field is nonzero, we print b followed by the hdr followed by the field value. After we have encountered the first such nonzero field, b is set to |.

  • printf ORS

    After printing the last field, we print an output record separator (default is a newline).

Upvotes: 3

Related Questions