Why is cuml predict() method for KNearestNeighbors taking so long with dask_cudf DataFrame?

Question

I have a large dataset (around 80 million rows) and I am training a KNearestNeighbors Regression model using cuml with a dask_cudf DataFrame.

I am using 4 GPU's with an rmm_pool_size of 15GB each:

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import cudf, cuml
import dask_cudf

cluster = LocalCUDACluster(
    rmm_pool_size="15GB"
)

client = Client(cluster)
client.run(cudf.set_allocator, "managed")

I am reading the data from a parquet file stored in an S3 bucket:

df = dask_cudf.read_parquet("s3://path-to-parquet/", chunksize="2 GB", dtype=dtypes)

When I fit the KNN model this runs fine and I can see the GPU utilization is high during this time. This is the code I used to fit the model:

from cuml.dask.neighbors import KNeighborsRegressor
from dask_ml.model_selection import train_test_split    

target = "target_lat"
X = train_df.drop(columns=target)
y = train_df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

model = KNeighborsRegressor(n_neighbors=5, client=client)
model.fit(X_train, y_train)

However when I try to output the predictions for the test set, this takes a huge amount of time compared to the fit method.

predictions = model.predict(X_test)

I waited almost 24 hours to finally see the results of the predict method on one occasion. It was also clear that the GPU utilization during the running of the predict method was much lower, it dropped to approximately 30-40% (it was ~100% during training), see the screenshot below:

I could use some help in understanding why the predict method is taking so long and if I've done something wrong in my code. For reference, I am following the KNN Regressor example given on this documentation site: https://docs.rapids.ai/api/cuml/stable/api.html#id23

Any help would be greatly appreciated, thank you!

Why is cuml predict() method for KNearestNeighbors taking so long with dask_cudf DataFrame?

Answers (1)

Related Questions