awk merge lines based on multiple field matching/unmatching

Question

We have a csv:

targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator ,   result_value, unit_value , experiment_date , discipline, activity_flag 
51, cpd-7788990 ,9999,0,  IC50  ,,10, uM ,  2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM ,  2006-08-01 00:00:00  ,  Enzyme ,
51, cpd-7788990 ,1212,2323,  IC50  ,,100, uM ,  2006-09-01 00:00:00  , Cell ,

Our final goal is to get : if the “cpd_number” ($2) is the same but “discipline” ($10) is not “Cell”, merge the lines “discipline” ($10) is “Cell” and not the “Cell” together. (The “discipline” has only 3 options: Biochemical, Cell, Enzyme.) The following is the ideal output.
(Note) The new “result_value” ($7) = “result_value” ($7) of the line where “discipline” ($10) is “Cell” divide by “result_value” ($7) of the line where “discipline” ($10) is “Biochemical” or "Enzyme".

targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator,   result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline 
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme

It looks quite complicated to do it at one time. Thus, I am trying to merge the whole line first: if the “cpd_number” ($2) is the same but the “discipline” ($10) are “different”, merge the lines “discipline” ($10) are “Cell” together with lines “discipline” ($10) are not the “Cell”. After this merge, we can use awk for the further clean up/re-name the header. Could any guru kindly offer some ideas how to write this one-liner? This is just a toy example. The actual csv files are quite huge so starting with /^51/ might not be ideal. Thanks!

targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator ,   result_value, unit_value , experiment_date , discipline, activity_flag, targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator ,   result_value, unit_value , experiment_date , discipline, activity_flag 
51, cpd-7788990 ,9999,0,  IC50  ,,10, uM ,  2006-07-01 00:00:00 , Biochemical , 51, cpd-7788990 ,1212,2323,  IC50  ,,100, uM ,  2006-09-01 00:00:00  , Cell ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM ,  2006-08-01 00:00:00  ,  Enzyme , 51, cpd-7788990 ,1212,2323,  IC50  ,,100, uM ,  2006-09-01 00:00:00  , Cell ,

Extra example:

targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator,   result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline 
51, cpd-7788990 ,9999,0,  IC50  ,,10, uM ,  2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM ,  2006-08-01 00:00:00  ,  Enzyme ,
51, cpd-7788990 ,1212,2323,  IC50  ,,100, uM ,  2006-09-01 00:00:00  , Cell ,
51, cpd-7788990 ,8888,9999,  IC50  ,,200, uM ,  2006-09-01 00:00:00  , Cell ,

Output:

targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator,   result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline 
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
51,cpd-7788990,8888,9999, IC50 ,9999,0,IC50,,20,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,8888,9999, IC50 ,4444,5555,Ki,>,40,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme

jaypal singh · Accepted Answer

This is a an awk script hacked together based on your sample input and final desired output. Feel free to tweak it around to suit your needs. It should be good enough to get you started. It takes two passes to your csv file. In the first pass, it builds an array based on 2nd column with discipline as cell and in second pass it formats the lines together. Since you have not stated what to do with lines that do not have a Cell discipline, the following solution ignores them.

Content of script.awk

BEGIN { 
    FS  = " *, *"                             # Set input field sep to this regex
    OFS = ","                                 # Set output field sep to comma
}

NR==FNR {                                     # In the first pass to the file
    if ($(NF-1) == "Cell") {                  # If the second last field is Cell
        flds[$2,$3,$4] = $3 OFS $4 OFS $5;    # Create an array to store col 3,4 and 5 separated by comma
        date[$2,$3,$4] = $9                   # Store date
        result[$2,$3,$4] = $7                 # Store col 7
    }
    next                                      # Move to the next record
} 

{                                             # For the second pass to the file
    for (fld in flds) {                       # For every entry in our array
        split (fld, tmp, SUBSEP);             # Split the composite key
        if ($(NF-1) != "Cell" && tmp[1] == $2) {   # If the last field is not Cell and first piece of key is same as col 2
            line = $0                         # Save the current line in a variable
            $3 = flds[fld] OFS $3             # modify col3 to put the value from array in front of col3
            $7 = result[fld] / $7             # Calculate the new result value
            $9 = date[fld] OFS $9             # Add the date
            $(NF-1) = "Cell" OFS $(NF-1)      # Place the Cell text
            NF--                              # Remove the last field
            print                             # print the line
            $0 = line                         # Swap the modified line back
        }
    }
}

$(NF-1) == "Cell" {                           # If the last field is Cell don't print it 
    next
}

Run it like:

$ awk -f script.awk file file
51,cpd-7788990,1212,2323,IC50,9999,0,IC50,,10,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,8888,9999,IC50,9999,0,IC50,,20,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,1212,2323,IC50,4444,5555,Ki,>,20,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme
51,cpd-7788990,8888,9999,IC50,4444,5555,Ki,>,40,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme

You can include the printing of header statement inside the BEGIN block.

awk merge lines based on multiple field matching/unmatching

Answers (1)

Related Questions