Reputation: 47
I have a file (file1) with the following contents:
ENST00000364447.1 116 16.000 0.000000 0.000000
ENST00000364424.1 107 17.000 0.000000 0.000000
ENST00000364180.1 107 17.000 0.000000 0.000000
ENST00000384451.1 107 17.000 0.000000 0.000000
ENST00000362957.1 109 17.000 0.000000 0.000000
ENST00000362478.1 107 17.000 0.000000 0.000000
ENST00000384227.1 107 17.000 0.000000 0.000000
ENST00000365615.1 107 17.000 0.000000 0.000000
ENST00000517091.1 106 17.000 0.000000 0.000000
I need to find entries in column 1 of this file that match text within column 10 of another file (file2):
chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.5"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
chr1 HAVANA transcript 29554 31097 . + . gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
In column 10 the name is in "".
I have tried grep -F -f file1 file2 > file3
, but it is incredibly slow. I've also tried a few different awk, but I can't seem to get the syntax right. Any help would be much appreciated.
Upvotes: 1
Views: 156
Reputation: 10039
pure awk:
awk 'FNR==NR{gsub(/[";]/,"",$10);F[$10];next}( $1 in F )' FilterFile2 DataFile1
if performance and size is huge (>10000 filter) an alternative could be
awk '{gsub(/[";]/,"",$10);print "^" $10 "[[:blank:]]"}' FilterFile2 > CleanFilter
grep -E -f CleanFilter DataFile1
Upvotes: 1