Reputation: 887
I have a large dataframe with a categorical variable. I would like to pull values from the subset of the dataframe ascribing to each value of the categorical variable and save this as a collection of lists (which are used to create sparse vectors in the code example I provide).
My current method iterates over each value of the categorical variable, selects a dataframe with this value, and then pulls lists from this sub-dataframe. It is quite slow, I think due to two things: looping over the dataframe and the creation of the sub-dataframe.
I would like to speed up the process and figure out a way to avoid this sort of looping through temporary dataframes (which I find myself doing frequently in my code). To give a sense of scale for my current project, I have about 7k categories on 5mil observations. I include code below to demonstrate my current workflow:
Dataframe setup:
import pandas as pd
c1=['a','b','c','d','e']*5
c2=[4,8,3,5,6]*6
c3=list(range(1,11))*3
df=pd.DataFrame(list(zip(c1,c2,c3)),columns=['catvar','weight','loc'])
Function that loops over subsets of the dataframe:
from scipy.sparse import csr_matrix
def make_sparse_vectors(df,
loc_colname='loc',
weighting_colname='weight',
cat_colname='catvar',
):
# create list of ids:
id_list=list(df[cat_colname].unique())
# length of sparse vector:
vlength=max(df[loc_colname])+1
# loop to create sparse vectors:
sparse_vector_dict={}
for i in id_list:
df_temp=df[df[cat_colname]==i]
temp_loc_list=df_temp[loc_colname].tolist()
temp_weight=df_temp[weighting_colname].tolist()
temp_row_list=[0]*len(temp_loc_list)
sparse_vector_dict[i]=csr_matrix((temp_weight,(temp_row_list,temp_loc_list)),shape=(1,vlength))
return sparse_vector_dict
make_sparse_vectors(df)
Returns:
{'a': <1x11 sparse matrix of type '<class 'numpy.intc'>'
with 2 stored elements in Compressed Sparse Row format>,
'b': <1x11 sparse matrix of type '<class 'numpy.intc'>'
with 2 stored elements in Compressed Sparse Row format>,
'c': <1x11 sparse matrix of type '<class 'numpy.intc'>'
with 2 stored elements in Compressed Sparse Row format>,
'd': <1x11 sparse matrix of type '<class 'numpy.intc'>'
with 2 stored elements in Compressed Sparse Row format>,
'e': <1x11 sparse matrix of type '<class 'numpy.intc'>'
with 2 stored elements in Compressed Sparse Row format>}
The code snippet that I think can most be optimized is the point where I loop over unique values and create a temporary dataframe:
for i in id_list:
df_temp=df[df[cat_colname]==i]
Some thoughts:
Upvotes: 0
Views: 250
Reputation: 14104
Im not sure what your trying to return but you should use groupby
.
This is how I would do it
loc_colname='loc'
weighting_colname='weight'
cat_colname='catvar'
vlength = max(df[loc_colname]+1)
def create sparse vectors(df_temp):
temp_loc_list=df_temp[loc_colname].tolist()
temp_weight=df_temp[weighting_colname].tolist()
temp_row_list=[0]*len(temp_loc_list)
return csr_matrix((temp_weight,(temp_row_list,temp_loc_list)),shape=(1,vlength))
new_df = df.groupby(cat_colname).apply(create sparse vectors)
to get a dict, read more here
df_dict = new_df.to_dict()
You can also speed up this process a lot with swifter or dask. However, if the overheard is too much this could be slower.
fast_df = df.groupby(cat_colname).swifter.apply(create sparse vectors)
Upvotes: 1