Chubaka
Chubaka

Reputation: 3145

Categorize csv files based on $18 info and find the csv file in each category which has the largest unique number in $4

If we have three example input files:

Test_95_target_1334_assay_Detail3.csv

A,accession,result_id,cpd_number,lot_no,assay_id,alt_assay_id,version_no,result_type,type_desc,operator,result_value,unit_id,unit_value,unit_desc,batch_no,experiment_date,discipine,assay_name,activity_flag
95,PKC,123456,cpd-0123456,1,1334,5678,1,1,IC50,>,26.21,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Enzymatic,PBA,
95,PKC,123456,cpd-0123456,1,1334,4600,1,1,IC50,,17.1,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Enzymatic,PBA,
95,PKC,123456,cpd-1234567,1,1334,2995,1,1,Ki,,30,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Enzymatic,PBA,
95,PKC,123456,cpd-1234567,1,1334,2900,1,1,IC50,,30,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Enzymatic,PBA,

Test_95_target_1338_assay_Detail3.csv

A,accession,result_id,cpd_number,lot_no,assay_id,alt_assay_id,version_no,result_type,type_desc,operator,result_value,unit_id,unit_value,unit_desc,batch_no,experiment_date,discipine,assay_name,activity_flag
95,PKC,123456,cpd-0123456,1,1338,3999,1,1,IC50,,55,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Biochemical,PBA,
95,PKC,123456,cpd-0123456,1,1338,1985,1,1,IC50,,66,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Biochemical,PBA,
95,PKC,123456,cpd-1234007,1,1338,2995,1,1,Ki,,18,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Biochemical,PBA,
95,PKC,123456,cpd-1239867,1,1338,2900,1,1,IC50,,20,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Biochemical,PBA,
95,PKC,123456,cpd-1234567,1,1338,2900,1,1,IC50,,20,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Biochemical,PBA,

Test_95_target_2888_assay_Detail3

Test,accession,result_id,cpd_number,lot_no,assay_id,alt_assay_id,version_no,result_type,type_desc,operator,result_value,unit_id,unit_value,unit_desc,batch_no,experiment_date,discipine,assay_name,activity_flag
95,PKC,123456,cpd-0123456,1,2888,3830,1,1,IC50,>,24.49,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Cell,PBA,
95,PKC,123456,cpd-0123456,1,2888,4600,1,1,IC50,,19.6799,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Cell,PBA,
95,PKC,123456,cpd-1234567,1,2888,3830,1,1,IC50,,30,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Cell,PBA,
95,PKC,123456,cpd-5566778,1,2888,3830,1,1,IC50,,30,1,uM,micromolar,67682,1/24/2007 12:00:00AM,Cell,PBA,

Is there any way to use bash/awk (python is also welcome!) to categorize files of which the column 18th ($18) is "enzymatic", "biochemical" and "cell"? Our goal here is to choose the file which has the largest number of unique compounds ($4) from either biochemical or enzyme in $18 and choose the file which has the largest number of unique compounds from cell in $18.

In this case, we will choose "Test_95_target_1338_assay_Detail3.csv" from the files where the 18th column is either "Enzymatic" or "Biochemical". (Because "Test_95_target_1338_assay_Detail3.csv" has 3 unique compound in $4 while "Test_95_target_1334_assay_Detail3.csv" only has 2 unique compounds. 3 > 2)And we will choose "Test_95_target_2888_assay_Detail3.csv" for the cell category in this case because it is the only one.

The attempt is here: this script will find the csv file which has the most number of lines and make that file name into a variable for the following process. I have another script to find the csv file which has the most number of unique compounds ($4). I left that script in another laptop and will only be able to access it tomorrow morning. So just post the following one.

#!/bin/bash

for A in 95

do   

wc -l Test_${A}_target_*_assay_Detail_average.csv > Test_${A}_target.csv

### This will make

#4 Test_95_target_1334_assay_Detail3.csv
#4 Test_95_target_1338_assay_Detail3.csv
#4 Test_95_target_2388_assay_Detail3.csv
#13 Total

head -n -1 Test_${A}_target.csv > Test_${A}_target2.csv  # remove the last line "total"

sort -k1 -r -n Test_${A}_target2.csv > Test_${A}_target3.csv   # sort the count column

# Only pick the second column in the "wc -l" output

awk -F " " '{print $2}' Test_${A}_target3.csv > Test_${A}_target4.csv   # Grasp the $2 file name info

max=$(head -n 1 Test_${A}_target4.csv)   # Make the top file name as the variable "max" for the following process

echo $max

rm Test_${A}_target3.csv Test_${A}_target2.csv Test_${A}_target.csv

done

Output:

echo $max

Test_95_target_1338_assay_Detail3.csv

However, I couldn't quite figure out how to categorize csv files based on $18 info. Could any gurus kindly offer some comments or solutions? Thanks.

Upvotes: 0

Views: 150

Answers (2)

Mark Setchell
Mark Setchell

Reputation: 207465

Updated Answer

In the light of your comments, I have had a try at reworking the awk to match your new needs. I would probably encode them something like this:

#!/bin/bash

# Do enzymatic/biochemical first

for f in *.csv; do
   awk -F, -v IGNORECASE=1 'NR>1 && ($18 ~ "enzymatic" || $18 ~ "biochemical") && $12<=10 {print $12,FILENAME}' "$f"
done | sort -n | tail -3

# Now do cell types

for f in *.csv; do
   awk -F, -v IGNORECASE=1 'NR>1 && $18 ~ "cell" && $12<=10 {print $12,FILENAME}' "$f"
done | sort -n | tail -3

However, I think the following may be more efficient and easier

egrep -Hi "enzyme|biochemical" *.csv | awk -F, '$12<=10{split($1,a,":");filename=a[1];print filename,$12}' | sort -n | tail -3

grep -Hi "cell" *.csv | awk -F, '$12<=10{split($1,a,":");filename=a[1];print filename,$12}' | sort -n | tail -3

Original Answer

I think this is what you mean!

#!/bin/bash
for f in *.csv; do
   res=$(awk -F',' '
          BEGIN{IGNORECASE=1;field18ok=0} 
          $18 ~ "enzymatic" || $18 ~ "biochemical" || $18 ~ "cell" {field18ok=1}
          NR>1{if(!col4[$4]++)u++}
          END{print field18ok * u}' "$f")
   echo $res:$f
done | sort -n

It cycles through all .csv files, and passes them one at a time into awk.

If any line has one of your 3 keywords (upper or lower case) in field 18 it sets a flag to say field 18 is ok and one of the ones you are looking for. If field 18 is not one of the ones you are looking for, the variable fiedl18ok will stay set at zero and will make the answer printed at the end equal to zero.

The next part, starting NR>1 applies only to lines where the line number is greater than one, so it basically ignores the header line of the input file. It then sums unique values in column 4 by remembering all the values it has already seen in column 4 in an array called col4[]. So, the first time I add 1 to this array, I increment u (the number of unique things I have seen in field 4.

At the end, (END{}) it multiplies field18ok by the number of unique compounds in column 4. So, if the field18 is not one that you want, the answer will be zero, whereas if field 18 is one of the values you are looking for, it will be the number of unique values in field 4.

The output is then sorted numerically so you can pick the highest value easily.

Upvotes: 2

xecgr
xecgr

Reputation: 5193

This code reads all files data, and then get it's 18th position (index 17 because is zero based) and add to a dict with filename key the compound if match the value condition.
I've used a set, because this structure doesn't store duplicate values.
Finally, you only have to check all sets value to know which has the max unique values

import csv
files        = ['Test_95_target_1334_assay_Detail3.csv','Test_95_target_1338_assay_Detail3.csv', 'Test_95_target_2888_assay_Detail3.csv']
pos_to_check = 17 #zero based index
pos_compound = 3
values_to_check = ["enzymatic", "biochemical" , "cell"]
result = dict([(file,set([])) for file in files ]) #file : set of compounds

for file in files:   
    with open(file, 'r') as csvfile:
        csvreader = csv.reader(csvfile)
        for row in csvreader:   
            if row[pos_to_check].lower() in values_to_check:
                result[file].add(row[pos_compound])

#get key which has more elements
max(result.iterkeys(), key=(lambda key: len(result[key])))

Upvotes: 2

Related Questions