Sql_Pete_Belfast
Sql_Pete_Belfast

Reputation: 600

Dask Updating column similar to sql case

I have a dask dataframe with a column "is_internal" of type int64. I want to update this similar to the SQL case statement:

 CASE WHEN ltrim(rtrim(is_internal)) = '1' then 'Internal' else 'External' END as type

importing the data as:

import pandas as pd 
import dask.dataframe as dd
import time
t=time.process_time()
df_train = dd.read_csv(r"C:\test.bcp", sep='\t', sample=25000000)

Normally in pandas, I would do something similar to this but this is using a lot of space, which I'm limited on.

df_train.loc[df_train['is_internal'] == 1, 'type'] = 'internal'
df_train.loc[df_train['is_internal'] == 0, 'type'] = 'external'

What is the best method, that's not going to use a lot of space/memory with dask?

Upvotes: 0

Views: 214

Answers (1)

mdurant
mdurant

Reputation: 28673

You should not mutate Dask objects. You can achieve what you want using the .where method. Unfortunately, many find its formulation confusing; but in many cases where you would like to directly use a pandas way to proceed, you can wrap your code with map_partition:

def simple_where(df):
    df.loc[df['is_internal'] == 1, 'type'] = 'internal'
    df.loc[df['is_internal'] == 0, 'type'] = 'external'
    return df

df_out = df_train.map_partitions(simple_where)

Upvotes: 1

Related Questions