Using awk on all columns for just part of column content

Question

I trying to find a solution for the following. I have a list of gene IDs in my first column and in all the other columns the related GO terms. The number of columns behind each gene ID is therefor variable. As follows the first few lines:

TRINITY_DN173118_c0_g1  GO:0000139^cellular_component^Golgi membrane
TRINITY_DN49436_c2_g1   GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1   GO:0003677^molecular_function^DNA binding   GO:0006302^biological_process^double-strand break repair    GO:0006310^biological_process^DNA recombination

The GO terms are delimited with a tab. I want to keep the first column, with the IDs and all the columns that contain "biological_process". But how do I do that using awk, without a specific column to search in.

I basically want to use grep for columns, so was trying something with awk (but I am not experienced in awk at all):

awk '/biological_process/' -> I get the full line awk '{ print "biological_process" }' -> I only get biological process

Can someone help me out? THanks!

shaiki siegal · Accepted Answer

AWK:

awk -F"GO:" '{printf "%s",$1}{for(i=2;i<=NF;i++) if ($i~/biological_process/)printf FS"%s",$i ;print ""}' file

1) -F"GO:" - use "GO:" string as separator

2) {printf "%s",$1} - print the first column (without new line)

3) for(i=2;i<=NF;i++) - run on all columns beside the first one

4) ($i~/biological_process/) - check if string exists in col

5) printf FS"%s",$i - if string exists in column print the separator and the string

6) print "" - print new line

input file used:

  TRINITY_DN173118_c0_g1  GO:0000139^cellular_component^Golgi membrane
  TRINITY_DN49436_c2_g1   GO:0006351^biological_process^transcription, DNA-templated 
  TRINITY_DN47442_c0_g1   GO:0003677^molecular_function^DNA binding GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination

output

   TRINITY_DN173118_c0_g1  
   TRINITY_DN49436_c2_g1   GO:0006351^biological_process^transcription, DNA-templated
   TRINITY_DN47442_c0_g1   GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination

Thanks to Ed Morton for the feedback , I have edit the Answer :).

Using awk on all columns for just part of column content

Answers (2)

Related Questions