Lelouch Lamperouge
Lelouch Lamperouge

Reputation: 8421

Creating a very large sparse matrix csv from a list of condensed data

I have a dictionary of the format:

{
  "sample1": set(["feature1", "feature2", "feature3"]),
  "sample2": set(["feature1", "feature4", "feature5"]),
}

where I have 20M samples and 150K unique features.

I want to convert this into a csv of the format:

sample,feature1,feature2,feature3,feature4,feature5
sample1,1,1,1,0,0
sample2,1,0,0,1,1

What I have done so far:

  1. ALL_FEATURES = list(set(features))
  2. with open("features.csv", "w") as f:
        f.write("fvecmd5," + ",".join([str(x) for x in ALL_FEATURES]) + "\n")
        fvecs_lol = list(fvecs.items())
        fvecs_keys, fvecs_values = zip(*fvecs_lol)
        del fvecs_lol
        tmp = [["1" if feature in featurelist else "0" for feature in ALL_FEATURES] for featurelist in fvecs_values]
        for i, entry in enumerate(tmp):
            f.write(fvecs_keys[i] + "," + ",".join(entry) + "\n")
    

But this is running very slow. Are there faster ways? Maybe leveraging Numpy/Cython?

Upvotes: 1

Views: 726

Answers (3)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210922

You can use sklearn.feature_extraction.text.CountVectorizer, which produces a sparse matrix and then create a SparseDataFrame:

In [49]: s = pd.SparseSeries(d).astype(str).str.replace(r"[{,'}]",'')

In [50]: s
Out[50]:
sample1    feature1 feature2 feature3
sample2    feature1 feature5 feature4
dtype: object

In [51]: from sklearn.feature_extraction.text import CountVectorizer

In [52]: cv = CountVectorizer()

In [53]: r = pd.SparseDataFrame(cv.fit_transform(s),
                                s.index, 
                                cv.get_feature_names(), 
                                default_fill_value=0)

In [54]: r
Out[54]:
         feature1  feature2  feature3  feature4  feature5
sample1         1         1         1         0         0
sample2         1         0         0         1         1

Upvotes: 4

BENY
BENY

Reputation: 323326

Is this what you need ?

pd.Series(d).apply(','.join).str.get_dummies(sep=',')
Out[50]: 
         feature1  feature2  feature3  feature4  feature5
sample1         1         1         1         0         0
sample2         1         0         0         1         1

You can add to_csv at the end

How about this

s=pd.Series(d).to_frame('v')

s.v=list(map(','.join,s.v.values))

s.v.str.get_dummies(sep=',')
Out[86]: 
         feature1  feature2  feature3  feature4  feature5
sample1         1         1         1         0         0
sample2         1         0         0         1         1

Upvotes: 3

Tomer Levinboim
Tomer Levinboim

Reputation: 1012

So, you want to convert the CSV from a sparse representation to a dense representation.

How? You could load the csv into a sparse matrix (check out scipy.coo_matrix which sort of fits your case), convert to a dense numpy array (with np.array()) and save it back as a CSV (maybe going through list-of-lists first)

(OR you could use some fancy pandas coding as someone else suggested.)

HOWEVER, the real question is, why do you want to store such a large dataset in a dense format? It would be extremely inefficient in memory / disk space, and the conversion SHOULD take a long time for a large dataset. Specifically, if your dataset has 20M samples with 150k features, a dense representation would not fit in your memory and likely not even your disk.

Upvotes: 0

Related Questions