Reputation: 6896
I inherited a project using Dask Dataframe to create a dataframe.
from dask import dataframe as dd
# leaving out param values for brevity
df = dd.read_csv(
's3://some-bucket/*.csv.gz',
sep=delimiter,
header=header,
names=partition_column_names,
compression=table_compression,
encoding='utf-8',
error_bad_lines=False,
warn_bad_lines=True,
parse_dates=date_columns,
dtype=column_dtype,
blocksize=None,
)
df_len = len(df)
# more stuff
I take that Dataframe, process it, and turn it into Parquet.
The process works fine, but occasionally (still haven't identified the pattern), the process just hangs on the len(df)
. No errors, no exiting, nothing.
Is there any concept with Dask Dataframes to have a timeout on a Dataframe operation? Perhaps an option to turn on debugging to get better insight as to what is happening?
Upvotes: 1
Views: 208
Reputation: 28956
The diagnostics dashboard provides the most information here. https://docs.dask.org/en/latest/diagnostics-distributed.html has the richest information, but the local schedulers provide some information too (https://docs.dask.org/en/latest/diagnostics-local.html).
Upvotes: 1