How to avoid looping through categorical variables in pandas to view/operate on dataframe slices/subsets

Question

I have a large dataframe with a categorical variable. I would like to pull values from the subset of the dataframe ascribing to each value of the categorical variable and save this as a collection of lists (which are used to create sparse vectors in the code example I provide).

My current method iterates over each value of the categorical variable, selects a dataframe with this value, and then pulls lists from this sub-dataframe. It is quite slow, I think due to two things: looping over the dataframe and the creation of the sub-dataframe.

I would like to speed up the process and figure out a way to avoid this sort of looping through temporary dataframes (which I find myself doing frequently in my code). To give a sense of scale for my current project, I have about 7k categories on 5mil observations. I include code below to demonstrate my current workflow:

Dataframe setup:

import pandas as pd

c1=['a','b','c','d','e']*5
c2=[4,8,3,5,6]*6
c3=list(range(1,11))*3

df=pd.DataFrame(list(zip(c1,c2,c3)),columns=['catvar','weight','loc'])

Function that loops over subsets of the dataframe:

from scipy.sparse import csr_matrix

def make_sparse_vectors(df,
                        loc_colname='loc',
                        weighting_colname='weight',
                        cat_colname='catvar',
                       ):
    # create list of ids:
    id_list=list(df[cat_colname].unique())

    # length of sparse vector:
    vlength=max(df[loc_colname])+1

    # loop to create sparse vectors:
    sparse_vector_dict={}
    for i in id_list:
        df_temp=df[df[cat_colname]==i]

        temp_loc_list=df_temp[loc_colname].tolist()
        temp_weight=df_temp[weighting_colname].tolist()
        temp_row_list=[0]*len(temp_loc_list)

        sparse_vector_dict[i]=csr_matrix((temp_weight,(temp_row_list,temp_loc_list)),shape=(1,vlength))
    
    return sparse_vector_dict

make_sparse_vectors(df)

Returns:

{'a': <1x11 sparse matrix of type ''
    with 2 stored elements in Compressed Sparse Row format>,
 'b': <1x11 sparse matrix of type ''
    with 2 stored elements in Compressed Sparse Row format>,
 'c': <1x11 sparse matrix of type ''
    with 2 stored elements in Compressed Sparse Row format>,
 'd': <1x11 sparse matrix of type ''
    with 2 stored elements in Compressed Sparse Row format>,
 'e': <1x11 sparse matrix of type ''
    with 2 stored elements in Compressed Sparse Row format>}

The code snippet that I think can most be optimized is the point where I loop over unique values and create a temporary dataframe:

for i in id_list:
    df_temp=df[df[cat_colname]==i]

Some thoughts:

Pandas' groupby() function seems like it would be ideal, but from what I can tell in the documentation, it is primarily used to reduce the dimensionality of the dataframe. While useful in some instances, it is not applicable to this problem (since the lists I am pulling would be, in total, the same dimension as the dataframe)
masking might be helpful, but I have not been able to think of a mask that allows me to get at this without involving a loop.

Kenan · Accepted Answer

Im not sure what your trying to return but you should use groupby. This is how I would do it

loc_colname='loc'
weighting_colname='weight'
cat_colname='catvar'
vlength = max(df[loc_colname]+1)

def create sparse vectors(df_temp):
    temp_loc_list=df_temp[loc_colname].tolist()
    temp_weight=df_temp[weighting_colname].tolist()
    temp_row_list=[0]*len(temp_loc_list)

    return csr_matrix((temp_weight,(temp_row_list,temp_loc_list)),shape=(1,vlength))

new_df = df.groupby(cat_colname).apply(create sparse vectors) to get a dict, read more here

df_dict = new_df.to_dict()

You can also speed up this process a lot with swifter or dask. However, if the overheard is too much this could be slower.

fast_df = df.groupby(cat_colname).swifter.apply(create sparse vectors)

How to avoid looping through categorical variables in pandas to view/operate on dataframe slices/subsets

Answers (1)

Related Questions