Reputation: 10383
I wrote a function to downsampling data using pandas, but some of the datasets I have don´t fit in memory, so I want to try it out with dask, this is the working code I have now:
def sample_df(df,target_column = "target",positive_percentage = 35,index_col="index"):
"""
Takes as input a data frame with imbalanced records, e.g. x% of positive cases, and returns
a dataframe with the specified percentage, e.g 10%.
This is accomplished by downsampling the majority class.
"""
positive_cases = df[df[target_column]==1][index_col]
number_of_samples = int(((100/positive_percentage)-1)*len(positive_cases))
negative_cases = list(set(df[index_col]) - set(positive_cases))
try:
negative_sample = random.sample(negative_cases,number_of_samples)
except ValueError:
print ("The requests percentage is not valid for this dataset")
return pd.DataFrame()
final_sample = list(negative_sample) + list(positive_cases)
#df = df.iloc[final_sample]
df = df[df[index_col].isin(final_sample) ]
#df = df.reset_index(drop=True)
print ("New percentage is: ", df[target_column].sum()/len(df[target_column])*100 )
return df
The function can be used as:
import pandas as pd
import random
from sklearn.datasets import make_classification
x,y = make_classification(100000,500)
df = pd.DataFrame(x)
df["target"] = y
df["id"] = 1
df["id"] = df["id"].cumsum()
output_df = sample_df(df,target_column = "target",positive_percentage = 65,index_col="id")
This Works fine with pandas for small datasets, but when I tried with datasets that do not fit in memory either with pandas or dask the computer crash
How can I apply this function to each data chunk that dask reads and then merge the all?
Upvotes: 1
Views: 137
Reputation: 5359
This approach will work in pure pandas and not require dask depending on how small your subsampled dataset is. You can chunk the df and then apply your filters to each chunk, then append each chunk to an empty dataframe. You do operations on a chunk just like you do on a df. I am going to start with a file, because you have said you cannot load your data into memory. So I am changing the df arg in your function to infile and adding a chunk_size argument and setting the default to 10000, so each chunk will be processed as 10000 rows:
def sample_df(infile,target_column = "target",positive_percentage = 35,index_col="index", chunk_size=10000):
"""
Takes as input a data frame with imbalanced records, e.g. x% of positive cases, and returns
a dataframe with the specified percentage, e.g 10%.
This is accomplished by downsampling the majority class.
"""
df = pd.DataFrame()
for chunk in pd.read_csv(infile, chunksize=chunk_size):
positive_cases = chunk[chunk[target_column]==1][index_col]
number_of_samples = int(((100/positive_percentage)-1)*len(positive_cases))
negative_cases = list(set(chunk[index_col]) - set(positive_cases))
try:
negative_sample = random.sample(negative_cases,number_of_samples)
except ValueError:
print ("The requests percentage is not valid for this dataset")
return pd.DataFrame()
final_sample = list(negative_sample) + list(positive_cases)
#subdf = chunk.iloc[final_sample]
subdf = chunk[chunk[index_col].isin(final_sample) ]
#subdf = chunk.reset_index(drop=True)
# append each subsampled chunk to your df
df = df.append(subdf)
print ("New percentage is: ", df[target_column].sum()/len(df[target_column])*100 )
return df
Doing this will subsample each chunk of data rather than the whole df.
Upvotes: 1