Reputation: 1
I'm a Python newbie and have run into a problem that I can't find an answer for anywhere.
I'm trying to write code to filter a set of files based on another file. The files are arrays with multiple rows and columns. What I would like is to remove rows from the data files that match rows in the filter file for certain columns.
The code is:
paths = ('filepaths.txt')#file that has filepaths to open
filter_file = ('filter.txt')#file of items to filter
filtered = open('filtered.txt','w') #output file
filtering = open(filter_file, 'r').readlines()
for f in filtering:
filt = f.rstrip().split('\t')
files = open(paths).read().splitlines()
for file in files:
try:
lines = open(file,'r').readlines()
for l in lines:
data = l.rstrip().split('\t')
a = [data[0], data[5], data[6], data[10], data[11]] #data columns to match
b= [filt[0], filt[1], filt[2], filt[3], filt[4]] #filter columns to match
for i,j in zip(a,b): #loop through two lists to filter
if i != j:
matches = '\t'.join(data)
print (matches)
filtered.write(matches + '\n')
filtered.close()
The code executes, but doesn't work as I want. What I get back is the last row of each file, repeated 5 times.
Clearly, I am missing something. I'm not sure if zip is the right function to use, or if something else would be better. I'd appreciate any advice.
Edit:
Sample input for filter:
HSPG2 22161380 22161380 G A
PPTC7 110974744 110974744 G C
OR1S2 57971546 57971546 A C
Sample input for files to filter (extra columns left off):
TKTL1 8277 broad.mit.edu 37 X 153558089 153558089 + 3'UTR SNP G C C
MPP1 4354 broad.mit.edu 37 X 154014502 154014502 + Silent SNP G A A
BRCC3 79184 broad.mit.edu 37 X 154306908 154306908 + Silent SNP A T T
Sample output (extra columns left off):
BRCC3 79184 broad.mit.edu 37 X 154306908 154306908 + Silent SNP A T T
BRCC3 79184 broad.mit.edu 37 X 154306908 154306908 + Silent SNP A T T
BRCC3 79184 broad.mit.edu 37 X 154306908 154306908 + Silent SNP A T T
BRCC3 79184 broad.mit.edu 37 X 154306908 154306908 + Silent SNP A T T
BRCC3 79184 broad.mit.edu 37 X 154306908 154306908 + Silent SNP A T T
Upvotes: 0
Views: 2384
Reputation: 37319
I'm going to start with some simple changes, then show how you can use built in tools like Python's csv
library and the any
function to simplify the code.
Here's a version that cleans things up a little, and uses the correct logic, but doesn't introduce too many new language features. The main new things it uses are the with
statement (which automatically closes the files when exited) and iterating directly over a file rather than using readlines
:
paths = ('filepaths.txt')#file that has filepaths to open
filter_file = ('filter.txt')#file of items to filter
with open(filter_file, 'r') as filter_source:
filters = []
for line in filter_source:
filters.append(line.rstrip().split('\t'))
with open(paths, 'r') as filename_source:
filenames = []
for line in filename_source:
filenames.append(line.rstrip())
with open('filtered.txt','w') as filtered:
for filename in filenames:
with open(filename,'r') as datafile:
for line in datafile:
data = l.rstrip().split('\t')
a = [data[0], data[5], data[6], data[10], data[11]] #data columns to match
for filter in filters:
matched = True
for i,j in zip(a,filter):
if i != j:
matched = False
break
if matched:
# the data row matched a filter, stop checking for others
break
if not matched:
filtered.write(line)
One thing we do several times there is use a for loop to build up a list. There's a more concise expression that does the same thing called a list comprehension. So using that, we'd have:
with open(filter_file, 'r') as filter_source:
filters = [line.rstrip().split('\t') for line in filter_source]
with open(paths, 'r') as filename_source:
filenames = [line.rstrip() for line in filename_source]
But Python also has a useful csv
library that can take care of reading the tab delimited format:
import csv
with open(filter_file, 'rb') as filter_source:
filter_reader = csv.reader(filter_source, delimiter='\t')
filters = list(filter_reader)
When you iterate over it, it returns lists of the fields as separated by the delimiter character. Note that I opened it in b
mode; whether that makes a difference or not depends on your platform, but if it does the csv docs note that it's required.
You could use it similarly for the data files, and optionally even to write your filtered output using the writer
class.
Finally, the any
and all
builtins take iterables and return True
if any or all of the contents of the iterables evaluate to True
. You can use those to drop the nested for loop, using a generator expression - this is a construct similar to a list comprehension except that it's lazily evaluated, which is nice because any
and all
will shortcircuit. So here's a way to write this:
def match(dataline, filter):
return all(i==j for (i, j) in zip(dataline, filter))
In this particular case I'm not getting much out of the shortcircuiting, because I'm using zip
to build an actual list of tuples. But for such short lists it's fine, and zip
outperforms itertools.zip
(the lazy evaluating version) on lists that are already in memory.
Then you can use any
to concisely compare the row to all your filters, shortcircuiting as soon as one matches:
a = [data[0], data[5], data[6], data[10], data[11]]
if not any(match(a, filter) for filter in filters):
filtered.write(line)
Except that this is still overkill. The match
function is enforcing that all elements in its two inputs must be equal, but if you test whether two lists are equal that's part of what Python does automatically. The match
function as I wrote it will allow lists of unequal length to match as long as the starting elements of the longer list all match the shorter list, while Python list equality does not, but that's not an issue here. So this would also work:
a = [data[0], data[5], data[6], data[10], data[11]]
if not any (a==filter for filter in filters):
filtered.write(line)
Or, if longer than normal filters are something you might want to tolerate:
if not any (a==filter[:5] for filter in filters):
The non-slicing version can also be written with direct list membership testing:
if a not in filters:
filtered.write(line)
Also, as Blckknght points out Python has a better way to quickly test whether something like a line matches any of a number of patterns - the set
datatype, which uses constant time lookups. Lists, like those returned by the csv
library or by split
, can't be the members of a set - but tuples can, as long as the members of the tuples are themselves hashable. So if you convert your filters and your data line subsets into tuples, you can maintain a set instead of a list and check it even faster. To do that, you have to convert each filter to a tuple:
filters = set(tuple(filter) for filter in filter_reader)
Then, define a
as a tuple:
a = (data[0], data[5], data[6], data[10], data[11])
if a not in filters:
filtered.write(line)
If you're using a csv.writer
instance to write the output, you could even consolidate it further using the writerows
method and a generator expression:
filtered_writer.writerows(data for data in data_reader if (data[0], data[5], data[6], data[10], data[11]) not in filters)
So wrapping it all up, I would do this like this:
import csv
paths = ('filepaths.txt') #file that has filepaths to open
filter_file = ('filter.txt') #file of items to filter
with open(filter_file, 'rb') as filter_source:
filters = set(tuple(filter) for filter in csv.reader(filter_source, delimiter='\t'))
with open(paths, 'r') as filename_source:
filenames = [line.rstrip() for line in filename_source]
with open('filtered.txt','wb') as filtered:
filtered_writer = csv.writer(filtered, delimiter='\t')
for filename in filenames:
with open(filename,'rb') as datafile:
data_reader = csv.reader(datafile, delimiter='\t')
filtered_writer.writerows(data for data in data_reader if (data[0], data[5], data[6], data[10], data[11]) not in filters)
Upvotes: 2
Reputation: 3255
When you create filt
, you're creating one string variable and overwriting it multiple times. Try replacing
for f in filtering:
filt = f.rstrip().split('\t')
with
filt = [f.rstrip().split('\t') for f in filtering]
Now filt
is a list of lists, with each element representing one row. So for example, filt[0]
will give you the first row, and filt[2][3]
will give you the fourth column of the third row. You may have to modify the rest of your program to work correctly with this.
Upvotes: 0