Reputation: 3155
We have a csv:
targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag
51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM , 2006-08-01 00:00:00 , Enzyme ,
51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
Our final goal is to get : if the “cpd_number” ($2) is the same but “discipline” ($10) is not “Cell”, merge the lines “discipline” ($10) is “Cell” and not the “Cell” together. (The “discipline” has only 3 options: Biochemical, Cell, Enzyme.) The following is the ideal output.
(Note) The new “result_value” ($7) = “result_value” ($7) of the line where “discipline” ($10) is “Cell” divide by “result_value” ($7) of the line where “discipline” ($10) is “Biochemical” or "Enzyme".
targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator, result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
It looks quite complicated to do it at one time. Thus, I am trying to merge the whole line first: if the “cpd_number” ($2) is the same but the “discipline” ($10) are “different”, merge the lines “discipline” ($10) are “Cell” together with lines “discipline” ($10) are not the “Cell”. After this merge, we can use awk for the further clean up/re-name the header. Could any guru kindly offer some ideas how to write this one-liner? This is just a toy example. The actual csv files are quite huge so starting with /^51/ might not be ideal. Thanks!
targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag, targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag
51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical , 51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM , 2006-08-01 00:00:00 , Enzyme , 51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
Extra example:
targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator, result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline
51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM , 2006-08-01 00:00:00 , Enzyme ,
51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
51, cpd-7788990 ,8888,9999, IC50 ,,200, uM , 2006-09-01 00:00:00 , Cell ,
Output:
targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator, result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
51,cpd-7788990,8888,9999, IC50 ,9999,0,IC50,,20,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,8888,9999, IC50 ,4444,5555,Ki,>,40,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
Upvotes: 0
Views: 138
Reputation: 77185
This is a an awk script hacked together based on your sample input and final desired output. Feel free to tweak it around to suit your needs. It should be good enough to get you started. It takes two passes to your csv file. In the first pass, it builds an array based on 2nd column with discipline as cell and in second pass it formats the lines together. Since you have not stated what to do with lines that do not have a Cell discipline, the following solution ignores them.
Content of script.awk
BEGIN {
FS = " *, *" # Set input field sep to this regex
OFS = "," # Set output field sep to comma
}
NR==FNR { # In the first pass to the file
if ($(NF-1) == "Cell") { # If the second last field is Cell
flds[$2,$3,$4] = $3 OFS $4 OFS $5; # Create an array to store col 3,4 and 5 separated by comma
date[$2,$3,$4] = $9 # Store date
result[$2,$3,$4] = $7 # Store col 7
}
next # Move to the next record
}
{ # For the second pass to the file
for (fld in flds) { # For every entry in our array
split (fld, tmp, SUBSEP); # Split the composite key
if ($(NF-1) != "Cell" && tmp[1] == $2) { # If the last field is not Cell and first piece of key is same as col 2
line = $0 # Save the current line in a variable
$3 = flds[fld] OFS $3 # modify col3 to put the value from array in front of col3
$7 = result[fld] / $7 # Calculate the new result value
$9 = date[fld] OFS $9 # Add the date
$(NF-1) = "Cell" OFS $(NF-1) # Place the Cell text
NF-- # Remove the last field
print # print the line
$0 = line # Swap the modified line back
}
}
}
$(NF-1) == "Cell" { # If the last field is Cell don't print it
next
}
Run it like:
$ awk -f script.awk file file
51,cpd-7788990,1212,2323,IC50,9999,0,IC50,,10,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,8888,9999,IC50,9999,0,IC50,,20,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,1212,2323,IC50,4444,5555,Ki,>,20,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme
51,cpd-7788990,8888,9999,IC50,4444,5555,Ki,>,40,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme
You can include the printing of header statement inside the BEGIN
block.
Upvotes: 2