What is the way to add an index column in Dask when reading from a CSV?

Question

I'm trying to process a fairly large dataset that doesn't fit into memory using Pandas when loading it at once so I'm using Dask. However, I'm having difficulty in adding a unique ID column to the dataset once read when using the read_csv method. I keep getting an error (see Code). I'm trying to create an index column so I can set that new column as the index for the data, but the error appears to be telling me to set the index first before creating the column.

CODE

df = dd.read_csv(r'path	o\file\file.csv')  # File does not have a unique ID column, so I have to create one.
df['index_col'] = dd.from_array(np.arange(len(pc_df)))  # Trying to add an index column and fill it
# ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

Update

Using range(1, len(df) + 1 changed the error to: TypeError: Column assignment doesn't support type range

MRocklin · Accepted Answer

Right, it's hard to know number of lines in each chunk of a CSV file without reading through it, so it's hard to produce an index like 0, 1, 2, 3, ... if the dataset spans multiple partitions.

One approach would be to create a column of ones:

df["idx"] = 1

and then call cumsum

df["idx"] = df["idx"].cumsum()

But note that this does add a bunch of dependencies to the task graph that backs your dataframe, so some operations might not be as parallel as they were before.

What is the way to add an index column in Dask when reading from a CSV?

CODE

Update

Answers (1)

Related Questions