Sandwichnick
Sandwichnick

Reputation: 1466

Save Sparse pandas dataframe as parquet file

I want to save a sparse pandassdataframe as parquetfile. unfortunately it seems like sparse datatypes are not supported by the underlying pyarrow.

Consider this example code:

from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd


arr = np.random.random(size=(1000, 5))
arr[arr < .9] = 0

sp_arr = csr_matrix(arr)
sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr, columns = ['a','b','c','d','e'])
sdf.to_parquet('testfile.parquet')

This results in the following error:

TypeError: Sparse pandas data (column a) not supported.

My real dataset is very large, so I cannot 'dense' the dataframe. I like the dataframe format as i can have row and column names which a numpy matrix does not save.

Is there an available workaround or any other way to save the dataframe?

Upvotes: 2

Views: 1418

Answers (1)

Rok
Rok

Reputation: 416

You could probably create an ExtensionArray containing sparse tensors (see discussion here: https://lists.apache.org/thread/0m2lwnhf9xj57mhjdc9kxn6fhzkppqvo), create a table with this column and store it to parquet without touching Pandas.

Upvotes: 1

Related Questions