Apply filtering functions to data chuncks using dask

Question

I wrote a function to downsampling data using pandas, but some of the datasets I have don´t fit in memory, so I want to try it out with dask, this is the working code I have now:

def sample_df(df,target_column = "target",positive_percentage = 35,index_col="index"):
    """
    Takes as input a data frame with imbalanced records, e.g. x% of positive cases, and returns
    a dataframe with the specified percentage, e.g 10%.
    This is accomplished by downsampling the majority class.



    """

    positive_cases =  df[df[target_column]==1][index_col]
    number_of_samples = int(((100/positive_percentage)-1)*len(positive_cases))
    negative_cases =  list(set(df[index_col]) - set(positive_cases))

    try:
        negative_sample = random.sample(negative_cases,number_of_samples)
    except ValueError:
        print ("The requests percentage is not valid for this dataset")
        return pd.DataFrame()

    final_sample = list(negative_sample) + list(positive_cases)
    #df = df.iloc[final_sample]
    df = df[df[index_col].isin(final_sample) ] 
    #df = df.reset_index(drop=True)

    print ("New percentage is: ",  df[target_column].sum()/len(df[target_column])*100 )

    return df

The function can be used as:

import pandas as pd
import random
from sklearn.datasets import make_classification

x,y = make_classification(100000,500)
df = pd.DataFrame(x)
df["target"] = y
df["id"] = 1 
df["id"] = df["id"].cumsum()
output_df = sample_df(df,target_column = "target",positive_percentage = 65,index_col="id")

This Works fine with pandas for small datasets, but when I tried with datasets that do not fit in memory either with pandas or dask the computer crash

How can I apply this function to each data chunk that dask reads and then merge the all?

Apply filtering functions to data chuncks using dask

Answers (1)

Related Questions