Jones
Jones

Reputation: 343

Dask categorize() won't work after using .loc

I'm having a serious issue using dask (dask version: 1.00, pandas version: 0.23.3). I am trying to load a dask dataframe from a CSV file, filter the results into two separate dataframes, and perform operations on both.

However, after the split the dataframes and try to set the category columns as 'known', they remain 'unknown'. Thus I cannot continue with my operations (which require category columns to be 'known'.)

NOTE: I have created a minimum example as suggested using pandas instead of read_csv().

import pandas as pd
import dask.dataframe as dd

# Specify dtypes
b_dtypes = {
    'symbol': 'category',
    'price': 'float64',
}

i_dtypes = {
    'symbol': 'category',
    'price': 'object'
}

# Specify a function to quickly set dtypes
def to_dtypes(df, dtypes):
    for column, dtype in dtypes.items():
        if column in df.columns:
            df[column] = df.loc[:, column].astype(dtype)
    return df

# Set up our test data
data = [
    ['B', 'IBN', '9.9800'],
    ['B', 'PAY', '21.5000'],
    ['I', 'PAY', 'seventeen'],
    ['I', 'SPY', 'ten']
]

# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')

# Convert into dask
df = dd.from_pandas(pdf, npartitions=3)

#
## At this point 'df' simulates what I get when I read the mixed-type CSV file via dask
#

# Split the dataframe by the 'type' column
b_df = df.loc[df['type'] == 'B', :]
i_df = df.loc[df['type'] == 'I', :]

# Convert columns into our intended dtypes
b_df = to_dtypes(b_df, b_dtypes)
i_df = to_dtypes(i_df, i_dtypes)

# Let's convert our 'symbol' column to known categories
b_df = b_df.categorize(columns=['symbol'])
i_df['symbol'] = i_df['symbol'].cat.as_known()

# Is our symbol column known now?
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)

#
## print() returns 'False' for both, this makes me want to kill myself.
## (Please help...)
#

UPDATE: So it seems that if I shift the 'npartitions' parameters to 1, then print() returns True in both cases. So this appears to be an issue with the partitions containing different categories. However loading both dataframes into only two partitions is not feasible, so is there a way I can tell dask to do some sort of re-sorting to make the categories consistent across partitions?

Upvotes: 2

Views: 750

Answers (1)

rpanai
rpanai

Reputation: 13437

The answer for your problem is basically contained in doc. I'm referring to the part code commented by # categorize requires computation, and results in known categoricals I'll expand here because it seems to me you're misusing loc

import pandas as pd
import dask.dataframe as dd

# Set up our test data
data = [['B', 'IBN', '9.9800'],
        ['B', 'PAY', '21.5000'],
        ['I', 'PAY', 'seventeen'],
        ['I', 'SPY', 'ten']
       ]

# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')

# Convert into dask
ddf = dd.from_pandas(pdf, npartitions=3)

# Split the dataframe by the 'type' column
# reset_index is not necessary
b_df = ddf[ddf["type"] == "B"].reset_index(drop=True)
i_df = ddf[ddf["type"] == "I"].reset_index(drop=True)

# Convert columns into our intended dtypes
b_df = b_df.categorize(columns=['symbol'])
b_df["price"] = b_df["price"].astype('float64')
i_df = i_df.categorize(columns=['symbol'])

# Is our symbol column known now? YES
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)

Upvotes: 1

Related Questions