csijcs
csijcs

Reputation: 47

How to find entries in one file that match another file

I have a file (file1) with the following contents:

ENST00000364447.1   116 16.000  0.000000    0.000000
ENST00000364424.1   107 17.000  0.000000    0.000000
ENST00000364180.1   107 17.000  0.000000    0.000000
ENST00000384451.1   107 17.000  0.000000    0.000000
ENST00000362957.1   109 17.000  0.000000    0.000000
ENST00000362478.1   107 17.000  0.000000    0.000000
ENST00000384227.1   107 17.000  0.000000    0.000000
ENST00000365615.1   107 17.000  0.000000    0.000000
ENST00000517091.1   106 17.000  0.000000    0.000000

I need to find entries in column 1 of this file that match text within column 10 of another file (file2):

chr1    HAVANA  gene    29554   31109   .   +   .   gene_id "ENSG00000243485.5"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
chr1    HAVANA  transcript  29554   31097   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";

In column 10 the name is in "".

I have tried grep -F -f file1 file2 > file3, but it is incredibly slow. I've also tried a few different awk, but I can't seem to get the syntax right. Any help would be much appreciated.

Upvotes: 1

Views: 156

Answers (1)

NeronLeVelu
NeronLeVelu

Reputation: 10039

pure awk:

awk 'FNR==NR{gsub(/[";]/,"",$10);F[$10];next}( $1 in F )' FilterFile2 DataFile1

if performance and size is huge (>10000 filter) an alternative could be

awk '{gsub(/[";]/,"",$10);print "^" $10 "[[:blank:]]"}' FilterFile2 > CleanFilter 
grep -E -f  CleanFilter DataFile1

Upvotes: 1

Related Questions