Reputation: 31
I have a tab-delimited file that looks something like this with approximately 120 000 entries named no-dots.txt:
cluster0 E:1.2e-12^RecName: Full=Putative tyrosine phosphatase 123R;^Viruses
cluster1 E:1.2e-12^RecName: Full=Putative tyrosine phosphatase 123R;^Viruses
cluster2 E:1.2e-12^RecName: Full=Putative tyrosine phosphatase 123R;^Viruses
I have the following script so far:
readarray -t clusternames_array < clusternames.txt
for name in "${clusternames_array[@]}"
do
grep -w $name **no-dots.txt** | awk -F "\t" '{print $2}' | awk -F '=' '{print $2}' | awk -F ";" '{print $1}' | sed s/{[^{}]*}//g | sort | uniq -c | sort -k 1,1nr | head -n 1 | cut -b 5-8
done
I am grep-ping each cluster (cluster0, cluster1, cluster2, ... cluster120000) from the file and trying to extract information in the second column.
The next three awk steps, and the sed step are simply to reduce the
E:1.2e-12^RecName: Full=Putative tyrosine phosphatase 123R;^Viruses
to something along the lines of
Putative tyrosine phosphatase 123R
this step is fine for my purposes.
sort | uniq -c | sort
is simply to count the number of unique names in each cluster, sort them from ascending to descending value.
head -n 1
is for me to continue with only the name with the highest occurrence.
The output of this is usually something along the lines of
7 Putative tyrosine phosphatase 123R
Because of this formatting I simply use
cut -b 5-8
to extract the number of occurrences
cut -b 5-8 --complement
to extract the name of the most frequently occurring entry
I run this in a for loop in order to have a list of 120 000 numbers/names that I can simply paste into an Excel file. Ultimately I would like to have an entry for EVERY cluster even if grep does not find anything. However, if the output of this code is nothing (an empty string as far as I understand), it is not written to the list generated. The file I end up with is always much shorter.
How do I change my script to include lines that had no value so that I end up with an output file with the total 120 000 entries?
As an example I get a file like this:
name0
name1
name3
name4
name6
name7
name9
where name2, name5, name8 etc are omitted, but I want add any placeholder to maintain the position of each output:
name0
name1
NULL
name3
name4
NULL
name6
name7
NULL
name9
Upvotes: 1
Views: 947
Reputation: 6134
Append this to your loooong chain of commands:
| grep . || echo NULL
Explanation:
The command1 || command2
construct executes command2
only when command1
fails, i.e. it returns something else than 0.
By default, the exit status of (the value returned to the shell by) a pipeline (= command1 | command2 | ... | commandN
) is the exit status of the last command.
If your last command (cut -b 5-8
here) doesn't output anything, then the grep .
we have just added will fail and return 1 (failure). Consequently, the whole chain of piped commands will be considered by the shell to have failed and, due to the ||
operator, the shell will execute the command echo NULL
.
If your last command (cut -b 5-8
) outputs anything, then the output will be unchanged: grep .
will act as a no-op and return 0 (success) since it has found something. Consequently the whole chain of piped commands will be considered by the shell to have succeeded and, due to the ||
operator, echo NULL
won't be executed.
Upvotes: 2
Reputation: 67467
something like this...
$ awk -F'\t|=|;' '{print $1,$3}' no-dots.txt |
sort | uniq -c | sort -k2 -k1,1nr |
awk '!a[$2]++ {print $2,$3}' |
awk 'NR==FNR{a[$1]=$2; next} {print $1 in a?a[$1]:"NULL"}' - clusternames.txt
unfortunately you don't have testable input data so not tested.
Upvotes: 1