Reputation: 310
I've got a large (145 MB) CSV file I would like to work with in Python. I'm new to Python, and am trying to wrap my head around the data that csv.reader() gives in the form of an iterator. I've been searching and searching and I've found a ton of information on what an iterator is and how they work, but very little information on how to actually use them when processing data.
I understand the next() method and the whole stop iteration thing, but this just seems like an extremely awkward way to store and retrieve data. Short of running through every row in the iterator in a for loop and appending it to a list (which seems prohibitively cumbersome), I don't really know how to get the data I need out of the iterator, especially considering my data is sorted by column, not row. What is the intended way to use the csv.reader() function, and is there a better way to read the contents of my csv file?
Every time I need a specific data set, am I expected to iterator through and rebuild the iterator tens of thousands of times to get the full column of data I need? I guess I haven't tried that, but it just doesn't seem right...I must be missing something.
Upvotes: 3
Views: 5419
Reputation: 454
You can iterate by columns using itertools.
from itertools import izip
infile = csv.reader(open('t.txt'))
transposed = izip(*infile)
for c in transposed:
print c
Upvotes: 1
Reputation: 31339
An iterator is simply a way to iterate a list without holding it in memory. Technically a file can be bigger than your available memory, and even swap - which will make it a headache to iterate.
An iterator only promises it knows how to get the next value. This abstraction allows it to forget everything it used to store and not yet have everything it's going to store. So it can have a memory footprint as small as a single list item. When iterating a huge file that is very relieving.
That said, if you want different datasets you may want to first create the datasets in a single iteration and then use them. This can help you filter out data you are not going to use.
You can also do processing during the iteration.
You always have the option of holding the entire file in memory as a list but that's usually not what you want.
Here is some rough example of using an iterator for processing:
rows = []
# ... create an iterator
for row in iterator:
process(row)
# ... use rows
You can also use an iterator to filter the rows you're interested in:
# define an is_needed(row) predicate for a row
needed_rows = filter(is_needed, iterator)
Here is an example of storing the values in memory:
# ... create iterator
rows = list(iterator)
# ... use rows - contains all values
Upvotes: 1