Reputation: 10689
I have a function I am using to read in files of a particular format. My function looks likes this:
import csv
from collections import namedtuple
def read_file(f, name, header=True):
with open(f, mode="r") as infile:
reader = csv.reader(infile, delimiter="\t")
if header is True:
next(reader)
gene_data = namedtuple("Data", 'id, name, q, start, end, sym')
for row in reader:
row = data(*row)
yield row
I also have another type of file that I would like to read in with this function. However, the other file type needs a few slight parsing steps before I can use the read_file
function. For example, trailing periods need to be striped from column q
and the characters atr
need to be appended to the id
column. Obviously, I could create a new function, or add some optional arguments to the existing function, but is there a simple way to modify this function so that it can be used to read in an additional file type(s)? I was thinking of something along the lines of a decorator?
Upvotes: 2
Views: 191
Reputation: 2535
In the spirit of Niklas B.'s answer:
import csv, functools
from collections import namedtuple
def consumer(func):
@functools.wraps(func)
def start(*args, **kwargs):
g = func(*args, **kwargs)
g.next()
return g
return start
def csv_rows(infile, header, dest):
reader = csv.reader(infile, delimter='\t')
if header: next(reader)
for line in reader:
dest.send(line)
@consumer
def data_sets(dest):
gene_data = namedtuple("Data", 'id, name, q, start, end, sym')
while 1:
row = (yield)
dest.send(gene_data(*row))
def read_file_1(fn, header=True):
results, sink = getsink()
csv_rows(fn, header, data_sets(sink))
return results
def getsink():
r = []
@consumer
def _sink():
while 1:
x = (yield)
r.append(x)
return (r, _sink())
@consumer
def transform_data_sets(dest):
while True:
data = (yield)
dest.send(data[::-1]) # or whatever
def read_file_2(fn, header=True):
results, sink = getsink()
csv_rows(fn, header, data_sets(transform_data_sets(sink)))
return results
Upvotes: 0
Reputation: 45542
The object-oriented way would be this:
class GeneDataReader:
_GeneData = namedtuple('GeneData', 'id, name, q, start, end, sym')
def __init__(self, filename, has_header=True):
self._ignore_1st_row = has_header
self._filename = filename
def __iter__():
for row in self._tsv_by_row():
yield self._GeneData(*self.preprocess_row(row))
def _tsv_by_row(self):
with open(self._filename, 'r') as f:
reader = csv.reader(f, delimiter='\t')
if self._ignore_1st_row:
next(reader)
for row in reader:
yield row
def preprocess_row(self, row):
# does nothing. override in derived classes
return row
class SpecializedGeneDataReader(GeneDataReader):
def preprocess_row(self, row):
row[0] += 'atr'
row[2] = row[2].rstrip('.')
return row
The simplest way would be to modify your currently working code with an extra argument.
def read_file(name, is_special=False, has_header=True):
with open(name,'r') as infile:
reader = csv.reader(infile, delimiter='\t')
if has_header:
next(reader)
Data = namedtuple("Data", 'id, name, q, start, end, sym')
for row in reader:
if is_special:
row[0] += 'atr'
row[2] = row[2].rstrip('.')
row = Data(*row)
yield row
If you are looking for something less nested but still procedure based:
def tsv_by_row(name, has_header=True):
with open(f, 'r') as infile: #
reader = csv.reader(infile, delimiter='\t')
if has_header: next(reader)
for row in reader:
yield row
def gene_data_from_vanilla_file(name, has_header=True):
for row in tsv_by_row(name, has_header):
yield gene_data(*row)
def gene_data_from_special_file(name, has_header=True):
for row in tsv_by_row(name, has_header):
row[0] += 'atr'
row[2] = row[2].rstrip('.')
yield GeneData(*row)
Upvotes: 1
Reputation: 95308
Having such a monolithic function that takes a filename instead of an open file is by itself not very Pythonic. You are trying to implement a stream processor here (file stream -> line stream -> CSV record stream -> [transformator ->] data stream
), so using a generator is actually a good idea. I'd slightly refactor this to be a bit more modular:
import csv
from collections import namedtuple
def csv_rows(infile, header):
reader = csv.reader(infile, delimiter="\t")
if header: next(reader)
return reader
def data_sets(infile, header):
gene_data = namedtuple("Data", 'id, name, q, start, end, sym')
for row in csv_rows(infile, header):
yield gene_data(*row)
def read_file_type1(infile, header=True):
# for this file type, we only need to pass the caller the raw
# data objects
return data_sets(infile, header)
def read_file_type2(infile, header=True):
# for this file type, we have to pre-process the data sets
# before yielding them. A good way to express this is using a
# generator expression (we could also add a filtering condition here)
return (transform_data_set(x) for x in data_sets(infile, header))
# Usage sample:
with open("...", "r") as f:
for obj in read_file_type1(f):
print obj
As you can see, we have to pass the header
argument all the way through the function chain. This is a strong hint that an object-oriented approach would be appropriate here. The fact that we obviously face a hierarchical type structure here (basic data file, type1, type2) supports this.
Upvotes: 3
Reputation: 10493
I suggest you to create some row iterator like following:
with MyFile('f') as f:
for entry in f:
foo(entry)
You can do this by implementing a class for your own files with the following traits:
Next to it you may create some function open_my_file(filename)
that determines the file type and returns propriate file object to work with. This might be slightly enterprise way, but it worth to implement if you're dealing with multiple file types.
Upvotes: 1
Reputation: 77271
IMHO, the most Pythonic way would be converting the function to a base class, split file operations into methods and overriding these methods in new classes based on your base class.
Upvotes: 4