Reputation: 683
I'm trying to process a fairly large dataset that doesn't fit into memory using Pandas when loading it at once so I'm using Dask. However, I'm having difficulty in adding a unique ID column to the dataset once read when using the read_csv method. I keep getting an error (see Code). I'm trying to create an index column so I can set that new column as the index for the data, but the error appears to be telling me to set the index first before creating the column.
df = dd.read_csv(r'path\to\file\file.csv') # File does not have a unique ID column, so I have to create one.
df['index_col'] = dd.from_array(np.arange(len(pc_df))) # Trying to add an index column and fill it
# ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
Using range(1, len(df) + 1
changed the error to: TypeError: Column assignment doesn't support type range
Upvotes: 3
Views: 2191
Reputation: 57281
Right, it's hard to know number of lines in each chunk of a CSV file without reading through it, so it's hard to produce an index like 0, 1, 2, 3, ...
if the dataset spans multiple partitions.
One approach would be to create a column of ones:
df["idx"] = 1
and then call cumsum
df["idx"] = df["idx"].cumsum()
But note that this does add a bunch of dependencies to the task graph that backs your dataframe, so some operations might not be as parallel as they were before.
Upvotes: 6