Reputation: 600
I have a dask dataframe with a column "is_internal" of type int64. I want to update this similar to the SQL case statement:
CASE WHEN ltrim(rtrim(is_internal)) = '1' then 'Internal' else 'External' END as type
importing the data as:
import pandas as pd
import dask.dataframe as dd
import time
t=time.process_time()
df_train = dd.read_csv(r"C:\test.bcp", sep='\t', sample=25000000)
Normally in pandas, I would do something similar to this but this is using a lot of space, which I'm limited on.
df_train.loc[df_train['is_internal'] == 1, 'type'] = 'internal'
df_train.loc[df_train['is_internal'] == 0, 'type'] = 'external'
What is the best method, that's not going to use a lot of space/memory with dask?
Upvotes: 0
Views: 214
Reputation: 28673
You should not mutate Dask objects. You can achieve what you want using the .where
method. Unfortunately, many find its formulation confusing; but in many cases where you would like to directly use a pandas way to proceed, you can wrap your code with map_partition
:
def simple_where(df):
df.loc[df['is_internal'] == 1, 'type'] = 'internal'
df.loc[df['is_internal'] == 0, 'type'] = 'external'
return df
df_out = df_train.map_partitions(simple_where)
Upvotes: 1