SantoshGupta7
SantoshGupta7

Reputation: 6197

Fastest/most efficient data storage format for text and floating point numbers

I have heard CSV is best for text data, and numpy is best for numerical/floating point data. But my pandas dataframe has both text and floating point numbers.

I am looking at all the data storage formats available in Pandas.

text    CSV read_csv    to_csv
text    JSON    read_json   to_json
text    HTML    read_html   to_html
text    Local clipboard read_clipboard  to_clipboard
binary  MS Excel    read_excel  to_excel
binary  HDF5 Format read_hdf    to_hdf
binary  Feather Format  read_feather    to_feather
binary  Parquet Format  read_parquet    to_parquet
binary  Msgpack read_msgpack    to_msgpack
binary  Stata   read_stata  to_stata
binary  SAS read_sas     
binary  Python Pickle Format    read_pickle to_pickle
SQL SQL read_sql    to_sql
SQL Google Big Query    read_gbq    to_gbq

Are what is the best option for float/text data?

Best in terms of: reduce to smallest amount of memory

Best in terms of: fastest save/load times.

Upvotes: 2

Views: 1611

Answers (1)

Rafal Janik
Rafal Janik

Reputation: 309

You will be happiest with Parquet.

  • It is well supported not only in python but most languages.
  • It works great on small data and scales nicely to huge datasets.
  • It is relatively quick for writing and loading data.
  • Handles sparse datasets.
  • Also for compression (gzip and other).
  • Looks good on the old resume.

But most of all its easy to work and you can learn the finer points of it as you go.

Editing it to add a blog post on the topic with some benchmarks: https://tech.jda.com/efficient-dataframe-storage-with-apache-parquet/#

Upvotes: 3

Related Questions