chaitanyac3
chaitanyac3

Reputation: 1

Unable to load and compute dask_cudf dataframe into blazing table and seeing some memory related errors. (cudaErrorMemoryAllocation out of memory)

Issue :

Trying to load a file (CSV and Parquet) using Dask CUDF and seeing some memory related errors. The dataset can easily fit into memory and the file can be read correctly using BlazingSQL's read_parquet method. However the dask_cudf.read_parquet() method fails to do the same. Seeing the same error with both file formats.

Other observation is that when a blazingSQL table is created from cudf dataframe , the table gets created but with zero records.

It will be helpful if someone can give any pointers to get over this issue.

Dataset info:

No of rows - 126 Million No of colums - 209 File Format – parquet No of Partitions - 8 File size parquet - 400 MB File size csv - 62 GB

System info :

GPU - 6 ( V100 TESLA) Memory - 16GB GPU Cores - 32 Cores

Client info: Scheduler: tcp://127.0.0.1:36617 Dashboard: http://127.0.0.1:8787/status Cluster Workers: 4 Cores: 4 Memory: 239.89 GiB

Code :

from blazingsql import BlazingContext
from dask.distributed import Client,wait
from dask_cuda import LocalCUDACluster
import dask
import dask_cudf
cluster = LocalCUDACluster()
client = Client(cluster)
bc = BlazingContext(dask_client=client)


ddf = dask_cudf.read_parquet('/home/ubuntu/126M_dataset/')
bc.create_table('table', ddf.compute())

Error Message:

super(NumericalColumn, col).fillna(fill_value, method)
    501 
    502     def find_first_value(

~/miniconda3/lib/python3.7/site-packages/cudf/core/column/column.py in fillna(self, value, method, dtype)
    733         """
    734         return libcudf.replace.replace_nulls(
--> 735             input_col=self, replacement=value, method=method, dtype=dtype
    736         )
    737 

cudf/_lib/replace.pyx in cudf._lib.replace.replace_nulls()

cudf/_lib/scalar.pyx in cudf._lib.scalar.as_device_scalar()

~/miniconda3/lib/python3.7/site-packages/cudf/core/scalar.py in device_value(self)
     75         if self._device_value is None:
     76             self._device_value = DeviceScalar(
---> 77                 self._host_value, self._host_dtype
     78             )
     79         return self._device_value

cudf/_lib/scalar.pyx in cudf._lib.scalar.DeviceScalar.__init__()

cudf/_lib/scalar.pyx in cudf._lib.scalar.DeviceScalar._set_value()

cudf/_lib/scalar.pyx in cudf._lib.scalar._set_numeric_from_np_scalar()

MemoryError: std::bad_alloc: CUDA error at: /home/ubuntu/miniconda3/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory

System Info :

nvidia-smi info:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Tesla V1...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   49C    P0    55W / 300W |  16147MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA Tesla V1...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   48C    P0    56W / 300W |  16106MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA Tesla V1...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   46C    P0    61W / 300W |  16106MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA Tesla V1...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   48C    P0    60W / 300W |  16106MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    113949      C   ...ntu/miniconda3/bin/python      823MiB |
|    0   N/A  N/A    114055      C   ...ntu/miniconda3/bin/python    15319MiB |
|    1   N/A  N/A    114059      C   ...ntu/miniconda3/bin/python    16101MiB |
|    2   N/A  N/A    114062      C   ...ntu/miniconda3/bin/python    16101MiB |
|    3   N/A  N/A    114053      C   ...ntu/miniconda3/bin/python    16101MiB |
+-----------------------------------------------------------------------------+

Upvotes: 0

Views: 592

Answers (1)

Nick Becker
Nick Becker

Reputation: 4214

File size parquet - 400 MB File size csv - 62 GB GPU - 6 ( V100 TESLA) Memory - 16GB GPU Cores - 32 Cores

When you call compute on a Dask collection, it fully computes the result and brings it into the Client process as a single-GPU object. Your data is likely overwhelming the 16GB of memory on one of your GPUs. You are likely looking for persist, which fully computes the result and stores it in memory on the workers (note that the execution will happen in the background and persist will return quickly).

Additionally, you also shouldn't need to persist your data before creating a BlazingSQL table from a Dask object.

You may find this answer, this blog post, and this documentation useful.

Upvotes: 1

Related Questions