Reputation: 17617
I have a large csv file (5 GB) and I can read it with pandas.read_csv()
. This operation takes a lot of time 10-20 minutes.
How can I speed it up?
Would it be useful to transform the data in a sqllite
format? In case what should I do?
EDIT: More information: The data contains 1852 columns and 350000 rows. Most of the columns are float65 and contain numbers. Some other contains string or dates (that I suppose are considered as string)
I am using a laptop with 16 GB of RAM and SSD hard drive. The data should fit fine in memory (but I know that python tends to increase the data size)
EDIT 2 :
During the loading I receive this message
/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py:1164: DtypeWarning: Columns (1841,1842,1844) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
EDIT: SOLUTION
Read one time the csv file and save it as
data.to_hdf('data.h5', 'table')
This format is incredibly efficient
Upvotes: 3
Views: 3653
Reputation: 570
If the problem is in the processing overhead, then you can divide the file into smaller files and handle them in different CPU cores or threads. Also for some algorithms the python time will increase non-linearly and the dividing method will help in these cases.
Upvotes: 0
Reputation: 365845
This actually depends on which part of reading it is taking 10 minutes.
It's important to know which one you have. For example, stream-compressing the data (e.g., with gzip
) will make the first problem a lot better, but the second one even worse.
It sounds like you probably have the second problem, which is good to know. (However, there are things you can do that will probably be better no matter what the problem.)
Your idea of storing it in a sqlite database is nice because it can at least potentially solve all three at once; you only read the data in from disk as-needed, and it's stored in a reasonably compact and easy-to-process form. But it's not the best possible solution for the first two, just a "pretty good" one.
In particular, if you actually do need to do array-wide work across all 350000 rows, and can't translate that work into SQL queries, you're not going to get much benefit out of sqlite. Ultimately, you're going to be doing a giant SELECT
to pull in all the data and then process it all into one big frame.
Writing out the shape and structure information, then writing the underlying arrays in NumPy binary form. Then, for reading, you have to reverse that. NumPy's binary form just stores the raw data as compactly as possible, and it's a format that can be written blindingly quickly (it's basically just dumping the raw in-memory storage to disk). That will improve both the first and second problems.
Similarly, storing the data in HDF5 (either using Pandas IO or an external library like PyTables or h5py) will improve both the first and second problems. HDF5 is designed to be a reasonably compact and simple format for storing the same kind of data you usually store in Pandas. (And it includes optional compression as a built-in feature, so if you know which of the two you have, you can tune it.) It won't solve the second problem quite as well as the last option, but probably well enough, and it's much simpler (once you get past setting up your HDF5 libraries).
Finally, pickling the data may sometimes be faster. pickle
is Python's native serialization format, and it's hookable by third-party modules—and NumPy and Pandas have both hooked it to do a reasonably good job of pickling their data.
(Although this doesn't apply to the question, it may help someone searching later: If you're using Python 2.x, make sure to explicitly use pickle format 2; IIRC, NumPy is very bad at the default pickle format 0. In Python 3.0+, this isn't relevant, because the default format is at least 3.)
Upvotes: 4
Reputation: 619
Python has two built-in libraries called pickle
and cPickle
that can store any Python data structure.
cPickle
is identical to pickle
except that cPickle
has trouble with Unicode stuff and is 1000x faster.
Both are really convenient for saving stuff that's going to be re-loaded into Python in some form, since you don't have to worry about some kind of error popping up in your file I/O.
Having worked with a number of XML files, I've found some performance gains from loading pickles instead of raw XML. I'm not entirely sure how the performance compares with CSVs, but it's worth a shot, especially if you don't have to worry about Unicode stuff and can use cPickle
. It's also simple, so if it's not a good enough boost, you can move on to other methods with minimal time lost.
A simple example of usage:
>>> import pickle
>>> stuff = ["Here's", "a", "list", "of", "tokens"]
>>> fstream = open("test.pkl", "wb")
>>> pickle.dump(stuff,fstream)
>>> fstream.close()
>>>
>>> fstream2 = open("test.pkl", "rb")
>>> old_stuff = pickle.load(fstream2)
>>> fstream2.close()
>>> old_stuff
["Here's", 'a', 'list', 'of', 'tokens']
>>>
Notice the "b" in the file stream openers. This is important--it preserves cross-platform compatibility of the pickles. I've failed to do this before and had it come back to haunt me.
For your stuff, I recommend writing a first script that parses the CSV and saves it as a pickle; when you do your analysis, the script associated with that loads the pickle like in the second block of code up there.
I've tried this with XML; I'm curious how much of a boost you will get with CSVs.
Upvotes: 1