Converting a list of Counters to sparse Pandas DataFrame

Question

I am having trouble constructing a pandas DataFrame with sparse dtype. My input is a bunch of feature vectors stored as dicts or Counters. With sparse data like bag-of-words representation of text, it is often inappropriate and infeasible to store the data as a dense document x term matrix, and is necessary to maintain the sparsity of the data structure.

For example, say the input is:

docs = [{'hello': 1}, {'world': 1, '!': 2}]

Output should be equivalent to:

import pandas as pd
out = pd.DataFrame(docs).astype(pd.SparseDtype(float))

without creating dense arrays along the way. (We can check out.dtypes and out.sparse.density.)

Attempt 1:

out = pd.DataFrame(dtype=pd.SparseDtype(float))
out.loc[0, 'hello'] = 1
out.loc[1, 'world'] = 1
out.loc[1, '!'] = 2

But this produces a dense data structure.

Attempt 2:

out = pd.DataFrame({"hello": pd.SparseArray([]),
                    "world": pd.SparseArray([]),
                    "!": pd.SparseArray([])})
out.loc[0, 'hello'] = 1

But this raises TypeError: SparseArray does not support item assignment via setitem.

The solution I eventually found below did not work in earlier versions of Pandas where I tried it.

Converting a list of Counters to sparse Pandas DataFrame

Answers (1)

Related Questions