aerin
aerin

Reputation: 22684

Pickle dump Pandas DataFrame

This is a question from a lazy man.

I have 4 million rows of pandas DataFrame and would like to save them into smaller chunks of pickle files.

Why smaller chunks? To save/load them quicker.

My question is: 1) Is there a better way (in-built function) to save them in smaller pieces than manually chunking them using np.array_split?

2) Is there any graceful way of gluing them together when I read the chunks other than manually gluing them together?

Please Feel free to suggest any other data type suited for this job other than pickle.

Upvotes: 3

Views: 2721

Answers (2)

piRSquared
piRSquared

Reputation: 294488

I've been using this for a dataframe of size 7,000,000 x 250

Use hdfs DOCUMENTATION

df = pd.DataFrame(np.random.rand(5, 5))
df

enter image description here

df.to_hdf('myrandomstore.h5', 'this_df', append=False, complib='blosc', complevel=9)

new_df = pd.read_hdf('myrandomstore.h5', 'this_df')
new_df

enter image description here

Upvotes: 3

kpie
kpie

Reputation: 11110

If the goal is to save and load quickly you should look into using sql rather than raw text pickling. If your computer chokes when you ask it to write 4 million rows you can specify a chunk size.

From there you can query slices with std. SQL.

Upvotes: 4

Related Questions