agp
agp

Reputation: 31

Why is cuml predict() method for KNearestNeighbors taking so long with dask_cudf DataFrame?

I have a large dataset (around 80 million rows) and I am training a KNearestNeighbors Regression model using cuml with a dask_cudf DataFrame.

I am using 4 GPU's with an rmm_pool_size of 15GB each:

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import cudf, cuml
import dask_cudf

cluster = LocalCUDACluster(
    rmm_pool_size="15GB"
)

client = Client(cluster)
client.run(cudf.set_allocator, "managed")

I am reading the data from a parquet file stored in an S3 bucket:

df = dask_cudf.read_parquet("s3://path-to-parquet/", chunksize="2 GB", dtype=dtypes)

When I fit the KNN model this runs fine and I can see the GPU utilization is high during this time. This is the code I used to fit the model:

from cuml.dask.neighbors import KNeighborsRegressor
from dask_ml.model_selection import train_test_split    

target = "target_lat"
X = train_df.drop(columns=target)
y = train_df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

model = KNeighborsRegressor(n_neighbors=5, client=client)
model.fit(X_train, y_train)

However when I try to output the predictions for the test set, this takes a huge amount of time compared to the fit method.

predictions = model.predict(X_test)

I waited almost 24 hours to finally see the results of the predict method on one occasion. It was also clear that the GPU utilization during the running of the predict method was much lower, it dropped to approximately 30-40% (it was ~100% during training), see the screenshot below:

CPU & GPU utilization during predict method

I could use some help in understanding why the predict method is taking so long and if I've done something wrong in my code. For reference, I am following the KNN Regressor example given on this documentation site: https://docs.rapids.ai/api/cuml/stable/api.html#id23

Any help would be greatly appreciated, thank you!

Upvotes: 1

Views: 356

Answers (1)

vlaf
vlaf

Reputation: 11

The documentation for the distributed version of KNN Regressor can be found here.

Here are a few rules to follow to get optimal performance:

  1. The index (X_train and y_train) should be composed of large partitions distributed in a balanced way on the workers.

  2. The query (X_test) should optimally be composed of partitions that have a number of samples that is a multiple of the batch_size parameter. Their disposition on the workers is not important.

  3. The batch_size parameter that sets how many queries are processed at once can be set to a higher value.

Hope that it's helpful!

Upvotes: 1

Related Questions