Reputation: 398
I have a data table in tsv format, the content of the file looks like the following.
Gene_stable_ID Gene_stable_ID_version Transcript_stable_ID Transcript_stable_ID_version Gene_name Gene_type
ENSMUSG00000064372 ENSMUSG00000064372.1 ENSMUST00000082423 ENSMUST00000082423.1 Cyp Mt_tRNA
ENSMUSG00000064371 ENSMUSG00000064371.1 ENSMUST00000082422 ENSMUST00000082422.1 mt-Tt unprocessed_pseudogene
ENSMUSG00000064370 ENSMUSG00000064370.1 ENSMUST00000082421 ENSMUST00000082421.1 Cyp processed_pseudogene
ENSMUSG00000064369 ENSMUSG00000064369.1 ENSMUST00000082420 ENSMUST00000082420.1 Cyp pseudogene
My goal here is to get the rows where the 'Gene name' is 'Cyp' and 'Gene type' is 'protein_coding' or 'pseudogene' or 'processed_pseudogene' or 'processed_pseudogene'.
I used awk command to do this, like this.
grep -i Cyp mapping.tsv | awk -F "\t" '{ if($NF == "protein_coding" || $NF == "pseudogene" || $NF == "processed_pseudogene") { print }}'
Here, I only get the 'Gene name' - Cyp and 'Gene type' - protein_coding, the 'pseudogene' part is ignored.
Can you help me figure this out? Thanks.
Upvotes: 0
Views: 70
Reputation: 133458
Could you please try following.
awk '$(NF-1)~/Cyp[0-9]+/ && ($NF=="protein_coding" || $NF=="pseudogene" || $NF=="processed_pseudogene" || $NF=="processed_pseudogene"){print $(NF-1),$NF}' Input_file
OR a non-one liner form of above:
awk '
$(NF-1)~/Cyp[0-9]+/ && ($NF=="protein_coding" || $NF=="pseudogene" || $NF=="processed_pseudogene" || $NF=="processed_pseudogene"){
print $(NF-1),$NF
}' Input_file
Considering that your fields Gene_name
or Gene_type
don't have spaces in their names here. Also to print complete line remove {print $(NF-1),$NF}
part in above codes.
EDIT: In case you want to use regex to check condition use following(again regex needs to be modified as per your samples too):
awk '
$(NF-1)~/Cyp[0-9]+/ && ($NF=="protein_coding" || $NF~/.*pseudogene/ || $NF=="processed_pseudogene"){
print $(NF-1),$NF
}' Input_file
Upvotes: 3
Reputation: 765
Not exactly the answer but as per your condition I think only grep is sufficient.
try:
cat mapping.tsv |grep "Cyp"|grep -E "protein_coding|pseudogene|processed_pseudogene"
Upvotes: 1