user964689
user964689

Reputation: 822

filter one text file based on another

Friday afternoon and I am struggling with filtering one file based on the contents of another. I have one file with a list of tab separated values eg

1   H   3   0.3937180424
1   H   4   0.3594894329
1   H   5   0.3501040944
1   H   6   0.2699868938
1   H   7   0.3200876953
1   H   8   0.3047540533
1   H   9   0.3088543852
1   H   10  0.305982215
1   H   11  0.2798568174

and another file with tab separated values eg

chr1    1   74440
chr1    2   90281
chr1    3   136529
chr1    4   484700
chr1    5   294898
chr1    6   284812
chr1    7   432322
chr1    8   458256
chr1    9   290078
chr1    10  366518
chr1    11  342903

I want to filter the second file to only include positions in the first file. Currently the second file has a surplus of lines and some need to be removed. The position information comes from the first and third columns on the first file combined. So the position information in line one of the example is 1 3. meaning chromosome 1 position 3. This corresponds to chr1 3 in the second file (third line). Does anyone know a simple way to filter file 2 by file 1. I could remove the 'chr' string in file 2 if that makes it simpler. Any quick solution that I can use in the shell or in python (learning that language) would be really great. Need to solve this to then use output in an analysis.

Thanks in advance for your help,

Rubal

Upvotes: 0

Views: 2830

Answers (3)

glenn jackman
glenn jackman

Reputation: 247210

Just with awk:

awk -F '\t' '
  FILENAME == ARGV[1] { pair["chr" $1 FS $3] = 1; next }
  ($1 FS $2) in pair
' file1 file2

Upvotes: 1

Fredrik Pihl
Fredrik Pihl

Reputation: 45670

You asked for python:

#!/usr/bin/env python

F = {}

with open("f1") as fd:
    for line in fd:
        key="chr%s%s" % (line.split()[0], line.split()[2])
        F[key]=True

with open("f2") as fd:
    for line in fd:
        key="%s%s" % (line.split()[0], line.split()[1])

        if key in F:
            print line.strip()

output:

chr1    3   136529
chr1    4   484700
chr1    5   294898
chr1    6   284812
chr1    7   432322
chr1    8   458256
chr1    9   290078
chr1    10  366518
chr1    11  342903

Upvotes: 1

chepner
chepner

Reputation: 532333

Assuming you use bash as your shell, this may work. I'm not sure how performance will be if file1.txt is large.

grep -f <( awk '{print "chr"$1"\t"$3}' file1.txt ) file2.txt

Upvotes: 1

Related Questions