Change column variable values depending on condition in dask dataframes

Question

This question is the intended solution to apply lambda function to a dask dataframe. This solution that does not require a pandas dataframe to implement. The reason behind this is I have a larger than memory dataframe and loading it to memory will not work as is done in pandas. (pandas is really good if data fits in memory).

The solution to the linked question is below.

df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']}) #How to read this sort of format directly to dask dataframe?

ddf = dd.from_pandas(df, npartitions=2) # dask conversion
list1 = ['A','B','C'] #list1 of hearder names


for c in list1:
    vc = ddf[c].value_counts().compute()
    vc /= vc.sum()
    print(vc) # A table with the proportion of unique values
    for i in range(vc.count()):
        if vc[i]<0.5: # Checks whether the varaible value has a proportion of less than .5
            ddf[c] = ddf[c].where(ddf[c] != vc.index[i], 'others') #changes such variable value to 'others' (iterates though all clumns mentioned in list1)
    print(ddf.compute()) #shows how changes have been implemented column by column

However, the second for loop takes a very long time compute in the actual (larger than memory) dataframe. Is there a more efficient way of getting the same output using dask.

The objective of the code is to change the column variable value to others for labels that have appeared less than 50% of the time in the column. For example if the value ant has appeared less than 50% of the time in a column then change the name to others

Would anyone be able to help me with this regard.

Thanks

Michael

rpanai · Accepted Answer

Here is a way to skip your nested loop:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'],
                   'B':['cat','peach', 'cat', 'cat', 'peach'],
                   'C':['dog','dog','roo', 'emu', 'emu']})

ddf = dd.from_pandas(df, npartitions=2)

l = len(ddf)

for col in ddf.columns:
    vc = (ddf[col].value_counts()/l)
    vc = vc[vc>.5].index.compute()
    ddf[col] = ddf[col].where(ddf[col].isin(vc), "other")

ddf = ddf.compute()

If you have a really big dataframe and it is on a parquet format you can try to read it column by column and save the result to different files. At the end you can just concatenate them horizontally.

Change column variable values depending on condition in dask dataframes

Answers (1)

Related Questions