Reputation: 3643
I am having trouble constructing a pandas DataFrame with sparse dtype. My input is a bunch of feature vectors stored as dicts or Counters. With sparse data like bag-of-words representation of text, it is often inappropriate and infeasible to store the data as a dense document x term matrix, and is necessary to maintain the sparsity of the data structure.
For example, say the input is:
docs = [{'hello': 1}, {'world': 1, '!': 2}]
Output should be equivalent to:
import pandas as pd
out = pd.DataFrame(docs).astype(pd.SparseDtype(float))
without creating dense arrays along the way. (We can check out.dtypes
and out.sparse.density
.)
Attempt 1:
out = pd.DataFrame(dtype=pd.SparseDtype(float))
out.loc[0, 'hello'] = 1
out.loc[1, 'world'] = 1
out.loc[1, '!'] = 2
But this produces a dense data structure.
Attempt 2:
out = pd.DataFrame({"hello": pd.SparseArray([]),
"world": pd.SparseArray([]),
"!": pd.SparseArray([])})
out.loc[0, 'hello'] = 1
But this raises TypeError: SparseArray does not support item assignment via setitem
.
The solution I eventually found below did not work in earlier versions of Pandas where I tried it.
Upvotes: 1
Views: 518
Reputation: 3643
This appears to work in Pandas 0.25.1:
out = pd.DataFrame([[0, 'hello', 1], [1, 'world', 1], [1, '!', 2]],
columns=['docid', 'word', 'n'])
out = out.set_index(['docid', 'word'])['n'].astype(pd.SparseDtype(float))
out = out.unstack()
Or more generically:
def dicts_to_sparse_dataframe(docs):
rows = ((i, k, v)
for i, doc in enumerate(docs)
for k, v in doc.items())
out = pd.DataFrame(rows, columns=['docid', 'word', 'n'])
out = out.set_index(['docid', 'word'])['n'].astype(pd.SparseDtype(float))
out = out.unstack()
return out
Then:
>>> docs = [{'hello': 1}, {'world': 1, '!': 2}]
>>> df = dicts_to_sparse_dataframe(docs)
>>> df.sparse.density
0.5
I'm hoping this does not create a dense in-memory structure along the way...
Upvotes: 1