Reputation: 31
I have a large dataset (around 80 million rows) and I am training a KNearestNeighbors Regression model using cuml with a dask_cudf DataFrame.
I am using 4 GPU's with an rmm_pool_size of 15GB each:
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import cudf, cuml
import dask_cudf
cluster = LocalCUDACluster(
rmm_pool_size="15GB"
)
client = Client(cluster)
client.run(cudf.set_allocator, "managed")
I am reading the data from a parquet file stored in an S3 bucket:
df = dask_cudf.read_parquet("s3://path-to-parquet/", chunksize="2 GB", dtype=dtypes)
When I fit the KNN model this runs fine and I can see the GPU utilization is high during this time. This is the code I used to fit the model:
from cuml.dask.neighbors import KNeighborsRegressor
from dask_ml.model_selection import train_test_split
target = "target_lat"
X = train_df.drop(columns=target)
y = train_df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
model = KNeighborsRegressor(n_neighbors=5, client=client)
model.fit(X_train, y_train)
However when I try to output the predictions for the test set, this takes a huge amount of time compared to the fit method.
predictions = model.predict(X_test)
I waited almost 24 hours to finally see the results of the predict method on one occasion. It was also clear that the GPU utilization during the running of the predict method was much lower, it dropped to approximately 30-40% (it was ~100% during training), see the screenshot below:
I could use some help in understanding why the predict method is taking so long and if I've done something wrong in my code. For reference, I am following the KNN Regressor example given on this documentation site: https://docs.rapids.ai/api/cuml/stable/api.html#id23
Any help would be greatly appreciated, thank you!
Upvotes: 1
Views: 356
Reputation: 11
The documentation for the distributed version of KNN Regressor can be found here.
Here are a few rules to follow to get optimal performance:
The index (X_train and y_train) should be composed of large partitions distributed in a balanced way on the workers.
The query (X_test) should optimally be composed of partitions that have a number of samples that is a multiple of the batch_size parameter. Their disposition on the workers is not important.
The batch_size parameter that sets how many queries are processed at once can be set to a higher value.
Hope that it's helpful!
Upvotes: 1