Reputation: 301
I have a dataframe "DF" with with 500,000 rows. Here are the data types per column:
ID int64
time datetime64[ns]
data object
each entry in the "data" column is an array with size = [5,500]
When I try to save this dataframe using
DF.to_pickle("my_filename.pkl")
it returned me the following error:
12 """
13 with open(path, 'wb') as f:
---> 14 pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
I also try this method but I get the same error:
import pickle
with open('my_filename.pkl', 'wb') as f:
pickle.dump(DF, f)
I try to save 10 rows of this dataframe:
DF.head(10).to_pickle('test_save.pkl')
and I have no error at all. Therefore, it can save small DF but not large DF.
I am using python 3, ipython notebook 3 in Mac.
Please help me to solve this problem. I really need to save this DF to a pickle file. I can not find the solution in the internet.
Upvotes: 18
Views: 17323
Reputation: 27
I ran into this same issue and traced the cause to a memory issue. According to this recourse it's usually not actually caused by the memory itself, but the movement of too many resources into the swap space. I was able to save the large pandas file by disableing swap all together withe the command (provided in that link):
swapoff -a
Upvotes: 0
Reputation: 29
Try to use compression. It worked for me.
data_df.to_pickle('data_df.pickle.gzde', compression='gzip')
Upvotes: 2
Reputation: 7554
Until there is a fix somewhere on pickle/pandas side of things, I'd say a better option is to use alternative IO backend. HDF is suitable for large datasets (GBs). So you don't need to add additional split/combine logic.
df.to_hdf('my_filename.hdf','mydata',mode='w')
df = pd.read_hdf('my_filename.hdf','mydata')
Upvotes: 17
Reputation: 163
Probably not the answer you were hoping for but this is what I did......
Split the dataframe into smaller chunks using np.array_split (although numpy functions are not guaranteed to work, it does now, although there used to be a bug for it).
Then pickle the smaller dataframes.
When you unpickle them use pandas.append or pandas.concat to glue everything back together.
I agree it is a fudge and suboptimal. If anyone can suggest a "proper" answer I'd be interested in seeing it, but I think it as simple as dataframes are not supposed to get above a certain size.
Split a large pandas dataframe
Upvotes: 4