User
User

Reputation: 398

AWK - print output

I have a data table in tsv format, the content of the file looks like the following.

    Gene_stable_ID  Gene_stable_ID_version  Transcript_stable_ID    Transcript_stable_ID_version    Gene_name   Gene_type
ENSMUSG00000064372  ENSMUSG00000064372.1    ENSMUST00000082423  ENSMUST00000082423.1    Cyp Mt_tRNA
ENSMUSG00000064371  ENSMUSG00000064371.1    ENSMUST00000082422  ENSMUST00000082422.1    mt-Tt   unprocessed_pseudogene
ENSMUSG00000064370  ENSMUSG00000064370.1    ENSMUST00000082421  ENSMUST00000082421.1    Cyp processed_pseudogene
ENSMUSG00000064369  ENSMUSG00000064369.1    ENSMUST00000082420  ENSMUST00000082420.1    Cyp pseudogene

My goal here is to get the rows where the 'Gene name' is 'Cyp' and 'Gene type' is 'protein_coding' or 'pseudogene' or 'processed_pseudogene' or 'processed_pseudogene'.

I used awk command to do this, like this.

grep -i Cyp mapping.tsv | awk -F "\t" '{ if($NF == "protein_coding" || $NF == "pseudogene" || $NF == "processed_pseudogene") { print }}'

Here, I only get the 'Gene name' - Cyp and 'Gene type' - protein_coding, the 'pseudogene' part is ignored.

Can you help me figure this out? Thanks.

Upvotes: 0

Views: 70

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133458

Could you please try following.

awk '$(NF-1)~/Cyp[0-9]+/ && ($NF=="protein_coding" || $NF=="pseudogene" || $NF=="processed_pseudogene" || $NF=="processed_pseudogene"){print $(NF-1),$NF}' Input_file

OR a non-one liner form of above:

awk '
$(NF-1)~/Cyp[0-9]+/ && ($NF=="protein_coding" || $NF=="pseudogene" || $NF=="processed_pseudogene" || $NF=="processed_pseudogene"){
  print $(NF-1),$NF
}'  Input_file

Considering that your fields Gene_name or Gene_type don't have spaces in their names here. Also to print complete line remove {print $(NF-1),$NF} part in above codes.

EDIT: In case you want to use regex to check condition use following(again regex needs to be modified as per your samples too):

awk '
$(NF-1)~/Cyp[0-9]+/ && ($NF=="protein_coding" || $NF~/.*pseudogene/ || $NF=="processed_pseudogene"){
  print $(NF-1),$NF
}'  Input_file

Upvotes: 3

Vikas Mulaje
Vikas Mulaje

Reputation: 765

Not exactly the answer but as per your condition I think only grep is sufficient.

try:

cat mapping.tsv |grep "Cyp"|grep -E "protein_coding|pseudogene|processed_pseudogene"

Upvotes: 1

Related Questions