dust
dust

Reputation: 512

Pandas for large(r) datasets

I have a rather complex database which I deliver in CSV format to my client. The logic to arrive at that database is an intricate mix of Python processing and SQL joins done in sqlite3.

There are ~15 source datasets ranging from a few hundreds records to as many as several million (but fairly short) records.

Instead of having a mix of Python / sqlite3 logic, for clarity, maintainability and several other reasons I would love to move ALL logic to an efficient set of Python scripts and circumvent sqlite3 altogether.

I understand that the answer and the path to go would be Pandas, but could you please advise if this is the right track for a rather large database like the one described above?

Upvotes: 2

Views: 208

Answers (1)

Edgar H
Edgar H

Reputation: 1518

I have been using Pandas with datasets > 20 GB in size (on a Mac with 8 GB RAM).

My main problem has been that there is a know bug in Python that makes it impossible to write files larger than 2 GB on OSX. However, using HDF5 circumvents that.

I found the tips in this and this article enough to make everything run without problem. The main lesson is to check the memory usage of your data frame and cast the types of the columns to the smallest possible data type.

Upvotes: 1

Related Questions