Reputation: 693
This question is the intended solution to apply lambda function to a dask dataframe. This solution that does not require a pandas dataframe to implement. The reason behind this is I have a larger than memory dataframe and loading it to memory will not work as is done in pandas. (pandas is really good if data fits in memory).
The solution to the linked question is below.
df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']}) #How to read this sort of format directly to dask dataframe?
ddf = dd.from_pandas(df, npartitions=2) # dask conversion
list1 = ['A','B','C'] #list1 of hearder names
for c in list1:
vc = ddf[c].value_counts().compute()
vc /= vc.sum()
print(vc) # A table with the proportion of unique values
for i in range(vc.count()):
if vc[i]<0.5: # Checks whether the varaible value has a proportion of less than .5
ddf[c] = ddf[c].where(ddf[c] != vc.index[i], 'others') #changes such variable value to 'others' (iterates though all clumns mentioned in list1)
print(ddf.compute()) #shows how changes have been implemented column by column
However, the second for loop takes a very long time compute in the actual (larger than memory) dataframe. Is there a more efficient way of getting the same output using dask.
The objective of the code is to change the column variable value to others
for labels that have appeared less than 50% of the time in the column. For example if the value ant
has appeared less than 50% of the time in a column then change the name to others
Would anyone be able to help me with this regard.
Thanks
Michael
Upvotes: 0
Views: 1351
Reputation: 13437
Here is a way to skip your nested loop:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'],
'B':['cat','peach', 'cat', 'cat', 'peach'],
'C':['dog','dog','roo', 'emu', 'emu']})
ddf = dd.from_pandas(df, npartitions=2)
l = len(ddf)
for col in ddf.columns:
vc = (ddf[col].value_counts()/l)
vc = vc[vc>.5].index.compute()
ddf[col] = ddf[col].where(ddf[col].isin(vc), "other")
ddf = ddf.compute()
If you have a really big dataframe and it is on a parquet format you can try to read it column by column and save the result to different files. At the end you can just concatenate them horizontally.
Upvotes: 1