user964689
user964689

Reputation: 822

Filter a file based on values in columns of another file

I have a file with a list of positions (columns 1 + 2) and values associated with those positions:

File1.txt:

1 20   A   G
4 400  T   C
1 12   A   T
2 500  G   C

And another file with some of the same positions. There may be multiple rows with the same positions as in File1.txt

File2.txt

#CHR    POS       Count_A Count_C Count_G Count_T
1       20        0       18      2       0
4       400       0       0       0       1
1       12        0       7       0       40
4       400       0       1       0       1
5       50        16      0       0       0
2       500       9       0       4       0

I need to output a version of File1.txt excluding any rows that ever meet both these two conditions:

1: If the positions (columns 1+2) match in File1.txt and File2.txt 2: If the count is > 0 in the column in File2.txt that matches the letter(A,G,C,T) in column 4 of File1.txt for that position.

So for the example above the first row of File1.txt would not be output because in file2.txt for the matching row (based on the first 2 columns: 1 20), the 4th column has the letter G and for this row in File2.txt the Count_G column is >0.

The only line that would be output for this example would be:

2 500 G C

To me the particularly tricky part is that there can be multiple matching rows in file2.txt and I want to exclude rows in File1.txt if the appropriate column in File2.txt is >0 in even just one row in File2.txt. Meaning that in the example above line 2 of File1.txt would not be included because Count_C is > 0 the second time that position appears in File2.txt (Count_C = 1).

I am not sure if that kind of filtering is possible in a single step. Would it be easier to output a list of rows in File1.txt where the count in File2.txt for the letter in the 4th column in File1.txt is >0. Then use this list to compare to File1.txt and remove any rows that appear in both files?

I've filtered one file based on values in another before with the code below, but this was for when there was only one column of values to filter for in file2.txt. I am not sure how to do the conditional filtering so that I check the right column based on the letter in column 4 of file1.txt

My current code is in python but any solution is welcome:

f2 = open('file2.txt', 'r')
d2 = {}
for line in f2.split('\n'):
    line = line.rstrip()
    fields = line.split("\t")
    key = (fields[0], fields[1])
    d2[key] = int(fields[2])

f1 = open('file1.txt', 'r')
for line in file1.split('\n'):
    line = line.rstrip()
    fields = line.split("\t")
    key = (fields[0], fields[1])
    if d2[key] > 1000:
        print line

I think my previous solution is already very verbose and feel there might be a simple tool for this kind of problem of which I am not aware.

Upvotes: 0

Views: 1302

Answers (2)

Attersson
Attersson

Reputation: 4876

Your code seems pretty good to me. You can perhaps edit

d2[key] = int(fields[2])

and

if d2[key] > 1000:
        print line

As they puzzle me a little bit.

I would do it like this:

f2 = open('file2.txt', 'r')
d2 = {}
for line in f2.split('\n'):
    fields = line.rstrip().split("\t")
    key = (fields[0], fields[1])
    d2[key] = {'A':int(fields[2]),'C':int(fields[3]),'G':int(fields[4]),
      'T':int(fields[5])}

f1 = open('file1.txt', 'r')
for line in f1.split('\n'):
    line = line.rstrip()
    fields = line.split("\t")
    key = (fields[0], fields[1])
    if (key not in d2) or (d2[key][str(fields[2])] == 0 and d2[key][str(fields[3])] == 0):
        print(line)

Edit: If you have an arbitrary number of letters (and columns in file 2) just generalize the dictionary inside d2 which I have hard coded. Easy. LEt's add 2 letters:

col_names = ['A','C','G','T','K','L']

for a,i in zip(fields[2:],range(len(fields[2:]))):
    d2[key][col_names.index(i)] = a

Upvotes: 1

choroba
choroba

Reputation: 242423

I used Perl to solve the problem. First, it loads File2 into a hash table keyed by the char, pos, and nucleotide, the value is the number for the nucleotide. Then, the second file is processed. If there's non-zero value in the hash table for its char, pos, and nucleotide, it's not printed.

#!/usr/bin/perl
use warnings;
use strict;

my %gt0;
open my $in2, '<', 'File2.txt' or die $!;
<$in2>;  # Skip the header.
while (<$in2>) {
    my %count;
    (my ($chr, $pos), @count{qw{ A C G T }}) = split;
    $gt0{$chr}{$pos}{$_} = $count{$_} for qw( A C G T );
}

open my $in1, '<', 'File1.txt' or die $!;
while (<$in1>) {
    my ($chr, $pos, undef, $c4) = split;
    print unless $gt0{$chr}{$pos}{$c4};
}

Upvotes: 1

Related Questions