Reputation: 1466
I want to save a sparse pandassdataframe as parquetfile. unfortunately it seems like sparse datatypes are not supported by the underlying pyarrow
.
Consider this example code:
from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd
arr = np.random.random(size=(1000, 5))
arr[arr < .9] = 0
sp_arr = csr_matrix(arr)
sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr, columns = ['a','b','c','d','e'])
sdf.to_parquet('testfile.parquet')
This results in the following error:
TypeError: Sparse pandas data (column a) not supported.
My real dataset is very large, so I cannot 'dense' the dataframe. I like the dataframe format as i can have row and column names which a numpy matrix does not save.
Is there an available workaround or any other way to save the dataframe?
Upvotes: 2
Views: 1418
Reputation: 416
You could probably create an ExtensionArray containing sparse tensors (see discussion here: https://lists.apache.org/thread/0m2lwnhf9xj57mhjdc9kxn6fhzkppqvo), create a table with this column and store it to parquet without touching Pandas.
Upvotes: 1