joeln
joeln

Reputation: 3643

Converting a list of Counters to sparse Pandas DataFrame

I am having trouble constructing a pandas DataFrame with sparse dtype. My input is a bunch of feature vectors stored as dicts or Counters. With sparse data like bag-of-words representation of text, it is often inappropriate and infeasible to store the data as a dense document x term matrix, and is necessary to maintain the sparsity of the data structure.

For example, say the input is:

docs = [{'hello': 1}, {'world': 1, '!': 2}]

Output should be equivalent to:

import pandas as pd
out = pd.DataFrame(docs).astype(pd.SparseDtype(float))

without creating dense arrays along the way. (We can check out.dtypes and out.sparse.density.)

Attempt 1:

out = pd.DataFrame(dtype=pd.SparseDtype(float))
out.loc[0, 'hello'] = 1
out.loc[1, 'world'] = 1
out.loc[1, '!'] = 2

But this produces a dense data structure.

Attempt 2:

out = pd.DataFrame({"hello": pd.SparseArray([]),
                    "world": pd.SparseArray([]),
                    "!": pd.SparseArray([])})
out.loc[0, 'hello'] = 1

But this raises TypeError: SparseArray does not support item assignment via setitem.

The solution I eventually found below did not work in earlier versions of Pandas where I tried it.

Upvotes: 1

Views: 518

Answers (1)

joeln
joeln

Reputation: 3643

This appears to work in Pandas 0.25.1:

out = pd.DataFrame([[0, 'hello', 1], [1, 'world', 1], [1, '!', 2]],
                   columns=['docid', 'word', 'n'])
out = out.set_index(['docid', 'word'])['n'].astype(pd.SparseDtype(float))
out = out.unstack()

Or more generically:

def dicts_to_sparse_dataframe(docs):
    rows = ((i, k, v)
            for i, doc in enumerate(docs)
            for k, v in doc.items())
    out = pd.DataFrame(rows, columns=['docid', 'word', 'n'])
    out = out.set_index(['docid', 'word'])['n'].astype(pd.SparseDtype(float))
    out = out.unstack()
    return out

Then:

>>> docs = [{'hello': 1}, {'world': 1, '!': 2}]
>>> df = dicts_to_sparse_dataframe(docs)
>>> df.sparse.density
0.5

I'm hoping this does not create a dense in-memory structure along the way...

Upvotes: 1

Related Questions