Reputation: 6197
I have heard CSV is best for text data, and numpy is best for numerical/floating point data. But my pandas dataframe has both text and floating point numbers.
I am looking at all the data storage formats available in Pandas.
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq
Are what is the best option for float/text data?
Best in terms of: reduce to smallest amount of memory
Best in terms of: fastest save/load times.
Upvotes: 2
Views: 1611
Reputation: 309
You will be happiest with Parquet.
But most of all its easy to work and you can learn the finer points of it as you go.
Editing it to add a blog post on the topic with some benchmarks: https://tech.jda.com/efficient-dataframe-storage-with-apache-parquet/#
Upvotes: 3