Reputation: 3145
If we have three example input files:
Test_95_target_1334_assay_Detail3.csv
A,accession,result_id,cpd_number,lot_no,assay_id,alt_assay_id,version_no,result_type,type_desc,operator,result_value,unit_id,unit_value,unit_desc,batch_no,experiment_date,discipine,assay_name,activity_flag
95,PKC,123456,cpd-0123456,1,1334,5678,1,1,IC50,>,26.21,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Enzymatic,PBA,
95,PKC,123456,cpd-0123456,1,1334,4600,1,1,IC50,,17.1,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Enzymatic,PBA,
95,PKC,123456,cpd-1234567,1,1334,2995,1,1,Ki,,30,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Enzymatic,PBA,
95,PKC,123456,cpd-1234567,1,1334,2900,1,1,IC50,,30,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Enzymatic,PBA,
Test_95_target_1338_assay_Detail3.csv
A,accession,result_id,cpd_number,lot_no,assay_id,alt_assay_id,version_no,result_type,type_desc,operator,result_value,unit_id,unit_value,unit_desc,batch_no,experiment_date,discipine,assay_name,activity_flag
95,PKC,123456,cpd-0123456,1,1338,3999,1,1,IC50,,55,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Biochemical,PBA,
95,PKC,123456,cpd-0123456,1,1338,1985,1,1,IC50,,66,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Biochemical,PBA,
95,PKC,123456,cpd-1234007,1,1338,2995,1,1,Ki,,18,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Biochemical,PBA,
95,PKC,123456,cpd-1239867,1,1338,2900,1,1,IC50,,20,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Biochemical,PBA,
95,PKC,123456,cpd-1234567,1,1338,2900,1,1,IC50,,20,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Biochemical,PBA,
Test_95_target_2888_assay_Detail3
Test,accession,result_id,cpd_number,lot_no,assay_id,alt_assay_id,version_no,result_type,type_desc,operator,result_value,unit_id,unit_value,unit_desc,batch_no,experiment_date,discipine,assay_name,activity_flag
95,PKC,123456,cpd-0123456,1,2888,3830,1,1,IC50,>,24.49,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Cell,PBA,
95,PKC,123456,cpd-0123456,1,2888,4600,1,1,IC50,,19.6799,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Cell,PBA,
95,PKC,123456,cpd-1234567,1,2888,3830,1,1,IC50,,30,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Cell,PBA,
95,PKC,123456,cpd-5566778,1,2888,3830,1,1,IC50,,30,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Cell,PBA,
Is there any way to use bash/awk (python is also welcome!) to categorize files of which the column 18th ($18) is "enzymatic", "biochemical" and "cell"? Our goal here is to choose the file which has the largest number of unique compounds ($4) from either biochemical or enzyme in $18 and choose the file which has the largest number of unique compounds from cell in $18.
In this case, we will choose "Test_95_target_1338_assay_Detail3.csv" from the files where the 18th column is either "Enzymatic" or "Biochemical". (Because "Test_95_target_1338_assay_Detail3.csv" has 3 unique compound in $4 while "Test_95_target_1334_assay_Detail3.csv" only has 2 unique compounds. 3 > 2)And we will choose "Test_95_target_2888_assay_Detail3.csv" for the cell category in this case because it is the only one.
The attempt is here: this script will find the csv file which has the most number of lines and make that file name into a variable for the following process. I have another script to find the csv file which has the most number of unique compounds ($4). I left that script in another laptop and will only be able to access it tomorrow morning. So just post the following one.
#!/bin/bash
for A in 95
do
wc -l Test_${A}_target_*_assay_Detail_average.csv > Test_${A}_target.csv
### This will make
#4 Test_95_target_1334_assay_Detail3.csv
#4 Test_95_target_1338_assay_Detail3.csv
#4 Test_95_target_2388_assay_Detail3.csv
#13 Total
head -n -1 Test_${A}_target.csv > Test_${A}_target2.csv # remove the last line "total"
sort -k1 -r -n Test_${A}_target2.csv > Test_${A}_target3.csv # sort the count column
# Only pick the second column in the "wc -l" output
awk -F " " '{print $2}' Test_${A}_target3.csv > Test_${A}_target4.csv # Grasp the $2 file name info
max=$(head -n 1 Test_${A}_target4.csv) # Make the top file name as the variable "max" for the following process
echo $max
rm Test_${A}_target3.csv Test_${A}_target2.csv Test_${A}_target.csv
done
Output:
echo $max
Test_95_target_1338_assay_Detail3.csv
However, I couldn't quite figure out how to categorize csv files based on $18 info. Could any gurus kindly offer some comments or solutions? Thanks.
Upvotes: 0
Views: 150
Reputation: 207465
Updated Answer
In the light of your comments, I have had a try at reworking the awk
to match your new needs. I would probably encode them something like this:
#!/bin/bash
# Do enzymatic/biochemical first
for f in *.csv; do
awk -F, -v IGNORECASE=1 'NR>1 && ($18 ~ "enzymatic" || $18 ~ "biochemical") && $12<=10 {print $12,FILENAME}' "$f"
done | sort -n | tail -3
# Now do cell types
for f in *.csv; do
awk -F, -v IGNORECASE=1 'NR>1 && $18 ~ "cell" && $12<=10 {print $12,FILENAME}' "$f"
done | sort -n | tail -3
However, I think the following may be more efficient and easier
egrep -Hi "enzyme|biochemical" *.csv | awk -F, '$12<=10{split($1,a,":");filename=a[1];print filename,$12}' | sort -n | tail -3
grep -Hi "cell" *.csv | awk -F, '$12<=10{split($1,a,":");filename=a[1];print filename,$12}' | sort -n | tail -3
Original Answer
I think this is what you mean!
#!/bin/bash
for f in *.csv; do
res=$(awk -F',' '
BEGIN{IGNORECASE=1;field18ok=0}
$18 ~ "enzymatic" || $18 ~ "biochemical" || $18 ~ "cell" {field18ok=1}
NR>1{if(!col4[$4]++)u++}
END{print field18ok * u}' "$f")
echo $res:$f
done | sort -n
It cycles through all .csv
files, and passes them one at a time into awk
.
If any line has one of your 3 keywords (upper or lower case) in field 18 it sets a flag to say field 18 is ok and one of the ones you are looking for. If field 18 is not one of the ones you are looking for, the variable fiedl18ok
will stay set at zero and will make the answer
printed at the end equal to zero.
The next part, starting NR>1
applies only to lines where the line number is greater than one, so it basically ignores the header line of the input file. It then sums unique values in column 4 by remembering all the values it has already seen in column 4 in an array called col4[]
. So, the first time I add 1 to this array, I increment u
(the number of unique things I have seen in field 4.
At the end, (END{}
) it multiplies field18ok by the number of unique compounds in column 4. So, if the field18 is not one that you want, the answer will be zero, whereas if field 18 is one of the values you are looking for, it will be the number of unique values in field 4.
The output is then sorted numerically so you can pick the highest value easily.
Upvotes: 2
Reputation: 5193
This code reads all files data, and then get it's 18th position (index 17 because is zero based) and add to a dict with filename key the compound if match the value condition.
I've used a set, because this structure doesn't store duplicate values.
Finally, you only have to check all sets value to know which has the max unique values
import csv
files = ['Test_95_target_1334_assay_Detail3.csv','Test_95_target_1338_assay_Detail3.csv', 'Test_95_target_2888_assay_Detail3.csv']
pos_to_check = 17 #zero based index
pos_compound = 3
values_to_check = ["enzymatic", "biochemical" , "cell"]
result = dict([(file,set([])) for file in files ]) #file : set of compounds
for file in files:
with open(file, 'r') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
if row[pos_to_check].lower() in values_to_check:
result[file].add(row[pos_compound])
#get key which has more elements
max(result.iterkeys(), key=(lambda key: len(result[key])))
Upvotes: 2