Reputation: 768
I'm trying to understand if BlazingSQL is a competitor or complementary to dask.
I have some medium-sized data (10-50GB) saved as parquet files on Azure blob storage.
IIUC I can query, join, aggregate, groupby with BlazingSQL using SQL syntax, but I can also read the data into CuDF using dask_cudf
and do all same operations using python/dataframe syntax.
So, it seems to me that they're direct competitors?
Is it correct that (one of) the benefits of using dask is that it can operate on partitions so can operate on datasets larger than GPU memory whereas BlazingSQL is limited to what can fit on the GPU?
Why would one choose to use BlazingSQL rather than dask?
Edit:
The docs talk about dask_cudf
but the actual repo is archived saying that dask support is now in cudf
itself. It would be good to know how to leverage dask
to operate on larger-than-gpu-memory datasets with cudf
Upvotes: 3
Views: 442
Reputation: 66
Full disclosure I'm a co-founder of BlazingSQL.
BlazingSQL and Dask are not competitive, in fact you need Dask to use BlazingSQL in a distributed context. All distibured BlazingSQL results return dask_cudf result sets, so you can then continuer operations on said results in python/dataframe syntax. To your point, you are correct on two counts:
If you wish to make RAPIDS accessible to more users SQL is a pretty easy onboarding process, and it's very easy to optimize for because of the reduced scope necessary to optimize SQL operations over Dask which has many other considerations.
Upvotes: 5