Fastest/most efficient data storage format for text and floating point numbers

Question

I have heard CSV is best for text data, and numpy is best for numerical/floating point data. But my pandas dataframe has both text and floating point numbers.

I am looking at all the data storage formats available in Pandas.

text    CSV read_csv    to_csv
text    JSON    read_json   to_json
text    HTML    read_html   to_html
text    Local clipboard read_clipboard  to_clipboard
binary  MS Excel    read_excel  to_excel
binary  HDF5 Format read_hdf    to_hdf
binary  Feather Format  read_feather    to_feather
binary  Parquet Format  read_parquet    to_parquet
binary  Msgpack read_msgpack    to_msgpack
binary  Stata   read_stata  to_stata
binary  SAS read_sas     
binary  Python Pickle Format    read_pickle to_pickle
SQL SQL read_sql    to_sql
SQL Google Big Query    read_gbq    to_gbq

Are what is the best option for float/text data?

Best in terms of: reduce to smallest amount of memory

Best in terms of: fastest save/load times.

Rafal Janik · Accepted Answer

You will be happiest with Parquet.

It is well supported not only in python but most languages.
It works great on small data and scales nicely to huge datasets.
It is relatively quick for writing and loading data.
Handles sparse datasets.
Also for compression (gzip and other).
Looks good on the old resume.

But most of all its easy to work and you can learn the finer points of it as you go.

Editing it to add a blog post on the topic with some benchmarks: https://tech.jda.com/efficient-dataframe-storage-with-apache-parquet/#

Fastest/most efficient data storage format for text and floating point numbers

Answers (1)

Related Questions