Reputation: 822
Friday afternoon and I am struggling with filtering one file based on the contents of another. I have one file with a list of tab separated values eg
1 H 3 0.3937180424
1 H 4 0.3594894329
1 H 5 0.3501040944
1 H 6 0.2699868938
1 H 7 0.3200876953
1 H 8 0.3047540533
1 H 9 0.3088543852
1 H 10 0.305982215
1 H 11 0.2798568174
and another file with tab separated values eg
chr1 1 74440
chr1 2 90281
chr1 3 136529
chr1 4 484700
chr1 5 294898
chr1 6 284812
chr1 7 432322
chr1 8 458256
chr1 9 290078
chr1 10 366518
chr1 11 342903
I want to filter the second file to only include positions in the first file. Currently the second file has a surplus of lines and some need to be removed. The position information comes from the first and third columns on the first file combined. So the position information in line one of the example is 1 3. meaning chromosome 1 position 3. This corresponds to chr1 3 in the second file (third line). Does anyone know a simple way to filter file 2 by file 1. I could remove the 'chr' string in file 2 if that makes it simpler. Any quick solution that I can use in the shell or in python (learning that language) would be really great. Need to solve this to then use output in an analysis.
Thanks in advance for your help,
Rubal
Upvotes: 0
Views: 2830
Reputation: 247210
Just with awk:
awk -F '\t' '
FILENAME == ARGV[1] { pair["chr" $1 FS $3] = 1; next }
($1 FS $2) in pair
' file1 file2
Upvotes: 1
Reputation: 45670
You asked for python:
#!/usr/bin/env python
F = {}
with open("f1") as fd:
for line in fd:
key="chr%s%s" % (line.split()[0], line.split()[2])
F[key]=True
with open("f2") as fd:
for line in fd:
key="%s%s" % (line.split()[0], line.split()[1])
if key in F:
print line.strip()
output:
chr1 3 136529
chr1 4 484700
chr1 5 294898
chr1 6 284812
chr1 7 432322
chr1 8 458256
chr1 9 290078
chr1 10 366518
chr1 11 342903
Upvotes: 1
Reputation: 532333
Assuming you use bash
as your shell, this may work. I'm not sure how performance will be if file1.txt is large.
grep -f <( awk '{print "chr"$1"\t"$3}' file1.txt ) file2.txt
Upvotes: 1