sj820
sj820

Reputation: 1

Python-filtering though multiple files based on the contents of another file

I'm a Python newbie and have run into a problem that I can't find an answer for anywhere.

I'm trying to write code to filter a set of files based on another file. The files are arrays with multiple rows and columns. What I would like is to remove rows from the data files that match rows in the filter file for certain columns.

The code is:

paths = ('filepaths.txt')#file that has filepaths to open
filter_file = ('filter.txt')#file of items to filter
filtered = open('filtered.txt','w') #output file

filtering = open(filter_file, 'r').readlines()
for f in filtering:
    filt = f.rstrip().split('\t')

files = open(paths).read().splitlines()
for file in files:
    try:
        lines = open(file,'r').readlines()
        for l in lines:
            data = l.rstrip().split('\t')

        a = [data[0], data[5], data[6], data[10], data[11]] #data columns to match
        b= [filt[0], filt[1], filt[2], filt[3], filt[4]] #filter columns to match

        for i,j in zip(a,b): #loop through two lists to filter
            if i != j:
                matches = '\t'.join(data)
                print (matches)
                filtered.write(matches + '\n')
filtered.close()

The code executes, but doesn't work as I want. What I get back is the last row of each file, repeated 5 times.

Clearly, I am missing something. I'm not sure if zip is the right function to use, or if something else would be better. I'd appreciate any advice.

Edit:

Sample input for filter:

HSPG2   22161380    22161380    G   A
PPTC7   110974744   110974744   G   C
OR1S2   57971546    57971546    A   C

Sample input for files to filter (extra columns left off):

TKTL1   8277    broad.mit.edu   37  X   153558089   153558089   +   3'UTR   SNP G   C   C
MPP1    4354    broad.mit.edu   37  X   154014502   154014502   +   Silent  SNP G   A   A
BRCC3   79184   broad.mit.edu   37  X   154306908   154306908   +   Silent  SNP A   T   T

Sample output (extra columns left off):

BRCC3   79184   broad.mit.edu   37  X   154306908   154306908   +   Silent  SNP A   T   T
BRCC3   79184   broad.mit.edu   37  X   154306908   154306908   +   Silent  SNP A   T   T
BRCC3   79184   broad.mit.edu   37  X   154306908   154306908   +   Silent  SNP A   T   T
BRCC3   79184   broad.mit.edu   37  X   154306908   154306908   +   Silent  SNP A   T   T
BRCC3   79184   broad.mit.edu   37  X   154306908   154306908   +   Silent  SNP A   T   T

Upvotes: 0

Views: 2384

Answers (2)

Peter DeGlopper
Peter DeGlopper

Reputation: 37319

I'm going to start with some simple changes, then show how you can use built in tools like Python's csv library and the any function to simplify the code.

Here's a version that cleans things up a little, and uses the correct logic, but doesn't introduce too many new language features. The main new things it uses are the with statement (which automatically closes the files when exited) and iterating directly over a file rather than using readlines:

paths = ('filepaths.txt')#file that has filepaths to open
filter_file = ('filter.txt')#file of items to filter
with open(filter_file, 'r') as filter_source:
    filters = []
    for line in filter_source:
        filters.append(line.rstrip().split('\t'))
with open(paths, 'r') as filename_source:
    filenames = []
    for line in filename_source:
        filenames.append(line.rstrip())
with open('filtered.txt','w') as filtered:
    for filename in filenames:
        with open(filename,'r') as datafile:
            for line in datafile:
                data = l.rstrip().split('\t')
                a = [data[0], data[5], data[6], data[10], data[11]] #data columns to match
                for filter in filters:
                    matched = True
                    for i,j in zip(a,filter):
                        if i != j:
                            matched = False
                            break
                    if matched:
                         # the data row matched a filter, stop checking for others
                         break
                if not matched:
                    filtered.write(line)

One thing we do several times there is use a for loop to build up a list. There's a more concise expression that does the same thing called a list comprehension. So using that, we'd have:

with open(filter_file, 'r') as filter_source:
    filters = [line.rstrip().split('\t') for line in filter_source]
with open(paths, 'r') as filename_source:
    filenames = [line.rstrip() for line in filename_source]

But Python also has a useful csv library that can take care of reading the tab delimited format:

import csv

with open(filter_file, 'rb') as filter_source:
    filter_reader = csv.reader(filter_source, delimiter='\t')
    filters = list(filter_reader)

When you iterate over it, it returns lists of the fields as separated by the delimiter character. Note that I opened it in b mode; whether that makes a difference or not depends on your platform, but if it does the csv docs note that it's required.

You could use it similarly for the data files, and optionally even to write your filtered output using the writer class.

Finally, the any and all builtins take iterables and return True if any or all of the contents of the iterables evaluate to True. You can use those to drop the nested for loop, using a generator expression - this is a construct similar to a list comprehension except that it's lazily evaluated, which is nice because any and all will shortcircuit. So here's a way to write this:

def match(dataline, filter):
    return all(i==j for (i, j) in zip(dataline, filter))

In this particular case I'm not getting much out of the shortcircuiting, because I'm using zip to build an actual list of tuples. But for such short lists it's fine, and zip outperforms itertools.zip (the lazy evaluating version) on lists that are already in memory.

Then you can use any to concisely compare the row to all your filters, shortcircuiting as soon as one matches:

a = [data[0], data[5], data[6], data[10], data[11]]
if not any(match(a, filter) for filter in filters):
    filtered.write(line)

Except that this is still overkill. The match function is enforcing that all elements in its two inputs must be equal, but if you test whether two lists are equal that's part of what Python does automatically. The match function as I wrote it will allow lists of unequal length to match as long as the starting elements of the longer list all match the shorter list, while Python list equality does not, but that's not an issue here. So this would also work:

a = [data[0], data[5], data[6], data[10], data[11]]
if not any (a==filter for filter in filters):
    filtered.write(line)

Or, if longer than normal filters are something you might want to tolerate:

if not any (a==filter[:5] for filter in filters):

The non-slicing version can also be written with direct list membership testing:

if a not in filters:
    filtered.write(line)

Also, as Blckknght points out Python has a better way to quickly test whether something like a line matches any of a number of patterns - the set datatype, which uses constant time lookups. Lists, like those returned by the csv library or by split, can't be the members of a set - but tuples can, as long as the members of the tuples are themselves hashable. So if you convert your filters and your data line subsets into tuples, you can maintain a set instead of a list and check it even faster. To do that, you have to convert each filter to a tuple:

filters = set(tuple(filter) for filter in filter_reader)

Then, define a as a tuple:

a = (data[0], data[5], data[6], data[10], data[11])
if a not in filters:
    filtered.write(line)

If you're using a csv.writer instance to write the output, you could even consolidate it further using the writerows method and a generator expression:

filtered_writer.writerows(data for data in data_reader if (data[0], data[5], data[6], data[10], data[11]) not in filters)

So wrapping it all up, I would do this like this:

import csv

paths = ('filepaths.txt') #file that has filepaths to open
filter_file = ('filter.txt') #file of items to filter
with open(filter_file, 'rb') as filter_source:
    filters = set(tuple(filter) for filter in csv.reader(filter_source, delimiter='\t'))
with open(paths, 'r') as filename_source:
    filenames = [line.rstrip() for line in filename_source]
with open('filtered.txt','wb') as filtered:
    filtered_writer = csv.writer(filtered, delimiter='\t')
    for filename in filenames:
        with open(filename,'rb') as datafile:
            data_reader = csv.reader(datafile, delimiter='\t')
            filtered_writer.writerows(data for data in data_reader if (data[0], data[5], data[6], data[10], data[11]) not in filters)

Upvotes: 2

hunse
hunse

Reputation: 3255

When you create filt, you're creating one string variable and overwriting it multiple times. Try replacing

for f in filtering:
    filt = f.rstrip().split('\t')

with

filt = [f.rstrip().split('\t') for f in filtering]

Now filt is a list of lists, with each element representing one row. So for example, filt[0] will give you the first row, and filt[2][3] will give you the fourth column of the third row. You may have to modify the rest of your program to work correctly with this.

Upvotes: 0

Related Questions