Reputation: 99
I trying to find a solution for the following. I have a list of gene IDs in my first column and in all the other columns the related GO terms. The number of columns behind each gene ID is therefor variable. As follows the first few lines:
TRINITY_DN173118_c0_g1 GO:0000139^cellular_component^Golgi membrane
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0003677^molecular_function^DNA binding GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
The GO terms are delimited with a tab. I want to keep the first column, with the IDs and all the columns that contain "biological_process". But how do I do that using awk, without a specific column to search in.
I basically want to use grep for columns, so was trying something with awk (but I am not experienced in awk at all):
awk '/biological_process/'
-> I get the full line
awk '{ print "biological_process" }'
-> I only get biological process
Can someone help me out? THanks!
Upvotes: 0
Views: 93
Reputation: 392
AWK:
awk -F"GO:" '{printf "%s",$1}{for(i=2;i<=NF;i++) if ($i~/biological_process/)printf FS"%s",$i ;print ""}' file
1) -F"GO:"
- use "GO:" string as separator
2) {printf "%s",$1}
- print the first column (without new line)
3) for(i=2;i<=NF;i++)
- run on all columns beside the first one
4) ($i~/biological_process/)
- check if string exists in col
5) printf FS"%s",$i
- if string exists in column print the separator and the string
6) print ""
- print new line
input file used:
TRINITY_DN173118_c0_g1 GO:0000139^cellular_component^Golgi membrane
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0003677^molecular_function^DNA binding GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
output
TRINITY_DN173118_c0_g1
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
Thanks to Ed Morton for the feedback , I have edit the Answer :).
Upvotes: 2
Reputation: 67467
another similar awk
$ awk 'BEGIN {FS=OFS="\t"}
{line=$1;
for(i=2;i<=NF;i++) if($i~/biological_process/) line=line OFS $i;
print line}' file
TRINITY_DN173118_c0_g1
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
Upvotes: 1