Farrah Khan
Farrah Khan

Reputation: 31

How to include grep searches that return blank values in my output?

I have a tab-delimited file that looks something like this with approximately 120 000 entries named no-dots.txt:

cluster0   E:1.2e-12^RecName: Full=Putative tyrosine phosphatase 123R;^Viruses
cluster1   E:1.2e-12^RecName: Full=Putative tyrosine phosphatase 123R;^Viruses
cluster2   E:1.2e-12^RecName: Full=Putative tyrosine phosphatase 123R;^Viruses

I have the following script so far:

readarray -t clusternames_array < clusternames.txt

for name in "${clusternames_array[@]}"
do
    grep -w $name **no-dots.txt** | awk -F "\t" '{print $2}' | awk -F '=' '{print $2}' | awk -F ";" '{print $1}' | sed s/{[^{}]*}//g | sort | uniq -c | sort -k 1,1nr | head -n 1 | cut -b 5-8
done

I am grep-ping each cluster (cluster0, cluster1, cluster2, ... cluster120000) from the file and trying to extract information in the second column.

The next three awk steps, and the sed step are simply to reduce the

E:1.2e-12^RecName: Full=Putative tyrosine phosphatase 123R;^Viruses

to something along the lines of

Putative tyrosine phosphatase 123R

this step is fine for my purposes.

sort | uniq -c | sort is simply to count the number of unique names in each cluster, sort them from ascending to descending value.

head -n 1 is for me to continue with only the name with the highest occurrence.

The output of this is usually something along the lines of

     7    Putative tyrosine phosphatase 123R

Because of this formatting I simply use

cut -b 5-8 to extract the number of occurrences cut -b 5-8 --complement to extract the name of the most frequently occurring entry

I run this in a for loop in order to have a list of 120 000 numbers/names that I can simply paste into an Excel file. Ultimately I would like to have an entry for EVERY cluster even if grep does not find anything. However, if the output of this code is nothing (an empty string as far as I understand), it is not written to the list generated. The file I end up with is always much shorter.

How do I change my script to include lines that had no value so that I end up with an output file with the total 120 000 entries?

As an example I get a file like this:

name0
name1
name3
name4
name6
name7
name9

where name2, name5, name8 etc are omitted, but I want add any placeholder to maintain the position of each output:

name0
name1
NULL
name3
name4
NULL
name6
name7
NULL
name9

Upvotes: 1

Views: 947

Answers (2)

xhienne
xhienne

Reputation: 6134

Append this to your loooong chain of commands:

| grep . || echo NULL

Explanation:

  • The command1 || command2 construct executes command2 only when command1 fails, i.e. it returns something else than 0.

  • By default, the exit status of (the value returned to the shell by) a pipeline (= command1 | command2 | ... | commandN) is the exit status of the last command.

  • If your last command (cut -b 5-8 here) doesn't output anything, then the grep . we have just added will fail and return 1 (failure). Consequently, the whole chain of piped commands will be considered by the shell to have failed and, due to the || operator, the shell will execute the command echo NULL.

  • If your last command (cut -b 5-8) outputs anything, then the output will be unchanged: grep . will act as a no-op and return 0 (success) since it has found something. Consequently the whole chain of piped commands will be considered by the shell to have succeeded and, due to the || operator, echo NULL won't be executed.

Upvotes: 2

karakfa
karakfa

Reputation: 67467

something like this...

$ awk -F'\t|=|;' '{print $1,$3}' no-dots.txt |
  sort | uniq -c | sort -k2 -k1,1nr          | 
  awk '!a[$2]++ {print $2,$3}'               | 
  awk 'NR==FNR{a[$1]=$2; next} {print $1 in a?a[$1]:"NULL"}' - clusternames.txt

unfortunately you don't have testable input data so not tested.

Upvotes: 1

Related Questions