amquack
amquack

Reputation: 887

How to avoid looping through categorical variables in pandas to view/operate on dataframe slices/subsets

I have a large dataframe with a categorical variable. I would like to pull values from the subset of the dataframe ascribing to each value of the categorical variable and save this as a collection of lists (which are used to create sparse vectors in the code example I provide).

My current method iterates over each value of the categorical variable, selects a dataframe with this value, and then pulls lists from this sub-dataframe. It is quite slow, I think due to two things: looping over the dataframe and the creation of the sub-dataframe.

I would like to speed up the process and figure out a way to avoid this sort of looping through temporary dataframes (which I find myself doing frequently in my code). To give a sense of scale for my current project, I have about 7k categories on 5mil observations. I include code below to demonstrate my current workflow:

Dataframe setup:

import pandas as pd

c1=['a','b','c','d','e']*5
c2=[4,8,3,5,6]*6
c3=list(range(1,11))*3

df=pd.DataFrame(list(zip(c1,c2,c3)),columns=['catvar','weight','loc'])

Function that loops over subsets of the dataframe:

from scipy.sparse import csr_matrix

def make_sparse_vectors(df,
                        loc_colname='loc',
                        weighting_colname='weight',
                        cat_colname='catvar',
                       ):
    # create list of ids:
    id_list=list(df[cat_colname].unique())

    # length of sparse vector:
    vlength=max(df[loc_colname])+1

    # loop to create sparse vectors:
    sparse_vector_dict={}
    for i in id_list:
        df_temp=df[df[cat_colname]==i]

        temp_loc_list=df_temp[loc_colname].tolist()
        temp_weight=df_temp[weighting_colname].tolist()
        temp_row_list=[0]*len(temp_loc_list)

        sparse_vector_dict[i]=csr_matrix((temp_weight,(temp_row_list,temp_loc_list)),shape=(1,vlength))
    
    return sparse_vector_dict

make_sparse_vectors(df)

Returns:

{'a': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>,
 'b': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>,
 'c': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>,
 'd': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>,
 'e': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>}

The code snippet that I think can most be optimized is the point where I loop over unique values and create a temporary dataframe:

for i in id_list:
    df_temp=df[df[cat_colname]==i]

Some thoughts:

Upvotes: 0

Views: 250

Answers (1)

Kenan
Kenan

Reputation: 14104

Im not sure what your trying to return but you should use groupby. This is how I would do it

loc_colname='loc'
weighting_colname='weight'
cat_colname='catvar'
vlength = max(df[loc_colname]+1)
def create sparse vectors(df_temp):
    temp_loc_list=df_temp[loc_colname].tolist()
    temp_weight=df_temp[weighting_colname].tolist()
    temp_row_list=[0]*len(temp_loc_list)

    return csr_matrix((temp_weight,(temp_row_list,temp_loc_list)),shape=(1,vlength))

new_df = df.groupby(cat_colname).apply(create sparse vectors) to get a dict, read more here

df_dict = new_df.to_dict()

You can also speed up this process a lot with swifter or dask. However, if the overheard is too much this could be slower.

fast_df = df.groupby(cat_colname).swifter.apply(create sparse vectors)

Upvotes: 1

Related Questions