Reputation: 8421
I have a dictionary of the format:
{
"sample1": set(["feature1", "feature2", "feature3"]),
"sample2": set(["feature1", "feature4", "feature5"]),
}
where I have 20M sample
s and 150K unique features.
I want to convert this into a csv of the format:
sample,feature1,feature2,feature3,feature4,feature5
sample1,1,1,1,0,0
sample2,1,0,0,1,1
What I have done so far:
ALL_FEATURES = list(set(features))
with open("features.csv", "w") as f:
f.write("fvecmd5," + ",".join([str(x) for x in ALL_FEATURES]) + "\n")
fvecs_lol = list(fvecs.items())
fvecs_keys, fvecs_values = zip(*fvecs_lol)
del fvecs_lol
tmp = [["1" if feature in featurelist else "0" for feature in ALL_FEATURES] for featurelist in fvecs_values]
for i, entry in enumerate(tmp):
f.write(fvecs_keys[i] + "," + ",".join(entry) + "\n")
But this is running very slow. Are there faster ways? Maybe leveraging Numpy/Cython?
Upvotes: 1
Views: 726
Reputation: 210922
You can use sklearn.feature_extraction.text.CountVectorizer, which produces a sparse matrix and then create a SparseDataFrame:
In [49]: s = pd.SparseSeries(d).astype(str).str.replace(r"[{,'}]",'')
In [50]: s
Out[50]:
sample1 feature1 feature2 feature3
sample2 feature1 feature5 feature4
dtype: object
In [51]: from sklearn.feature_extraction.text import CountVectorizer
In [52]: cv = CountVectorizer()
In [53]: r = pd.SparseDataFrame(cv.fit_transform(s),
s.index,
cv.get_feature_names(),
default_fill_value=0)
In [54]: r
Out[54]:
feature1 feature2 feature3 feature4 feature5
sample1 1 1 1 0 0
sample2 1 0 0 1 1
Upvotes: 4
Reputation: 323326
Is this what you need ?
pd.Series(d).apply(','.join).str.get_dummies(sep=',')
Out[50]:
feature1 feature2 feature3 feature4 feature5
sample1 1 1 1 0 0
sample2 1 0 0 1 1
You can add to_csv
at the end
How about this
s=pd.Series(d).to_frame('v')
s.v=list(map(','.join,s.v.values))
s.v.str.get_dummies(sep=',')
Out[86]:
feature1 feature2 feature3 feature4 feature5
sample1 1 1 1 0 0
sample2 1 0 0 1 1
Upvotes: 3
Reputation: 1012
So, you want to convert the CSV from a sparse representation to a dense representation.
How? You could load the csv into a sparse matrix (check out scipy.coo_matrix which sort of fits your case), convert to a dense numpy array (with np.array()) and save it back as a CSV (maybe going through list-of-lists first)
(OR you could use some fancy pandas coding as someone else suggested.)
HOWEVER, the real question is, why do you want to store such a large dataset in a dense format? It would be extremely inefficient in memory / disk space, and the conversion SHOULD take a long time for a large dataset. Specifically, if your dataset has 20M samples with 150k features, a dense representation would not fit in your memory and likely not even your disk.
Upvotes: 0