maafk
maafk

Reputation: 6896

Debug why Dask Dataframe operation is doing nothing

I inherited a project using Dask Dataframe to create a dataframe.

from dask import dataframe as dd

# leaving out param values for brevity

df = dd.read_csv(
    's3://some-bucket/*.csv.gz',
    sep=delimiter,
    header=header,
    names=partition_column_names,
    compression=table_compression,
    encoding='utf-8',
    error_bad_lines=False,
    warn_bad_lines=True,
    parse_dates=date_columns,
    dtype=column_dtype,
    blocksize=None,
)

df_len = len(df)

# more stuff

I take that Dataframe, process it, and turn it into Parquet.

The process works fine, but occasionally (still haven't identified the pattern), the process just hangs on the len(df). No errors, no exiting, nothing.

Is there any concept with Dask Dataframes to have a timeout on a Dataframe operation? Perhaps an option to turn on debugging to get better insight as to what is happening?

Upvotes: 1

Views: 208

Answers (1)

TomAugspurger
TomAugspurger

Reputation: 28956

The diagnostics dashboard provides the most information here. https://docs.dask.org/en/latest/diagnostics-distributed.html has the richest information, but the local schedulers provide some information too (https://docs.dask.org/en/latest/diagnostics-local.html).

Upvotes: 1

Related Questions