Jaydog
Jaydog

Reputation: 622

Can I set the index column when reading a CSV using Python dask?

When using Python Pandas to read a CSV it is possible to specify the index column. Is this possible using Python Dask when reading the file, as opposed to setting the index afterwards?

For example, using pandas:

df = pandas.read_csv(filename, index_col=0)

Ideally using dask could this be:

df = dask.dataframe.read_csv(filename, index_col=0)

I have tried

df = dask.dataframe.read_csv(filename).set_index(?)

but the index column does not have a name (and this seems slow).

Upvotes: 12

Views: 21686

Answers (3)

Sunil
Sunil

Reputation: 21

Now you can write: df = pandas.read_csv(filename, index_col='column_name') (Where column name is the name of the column you want to set as the index).

Upvotes: 2

E. Bassett
E. Bassett

Reputation: 166

I know I'm a bit late, but this is the first result on google so it should get answered.

If you write your dataframe with:

# index = True is default
my_pandas_df.to_csv('path')

#so this is same
my_pandas_df.to_csv('path', index=True)

And import with Dask:

import dask.dataframe as dd
my_dask_df = dd.read_csv('path').set_index('Unnamed: 0')

It will use column 0 as your index (which is unnamed thanks to pandas.DataFrame.to_csv() ).

How to figure it out:

my_dask_df = dd.read_csv('path')
my_dask_df.columns

which returns

Index(['Unnamed: 0', 'col 0', 'col 1',
       ...
       'col n'],
      dtype='object', length=...)

Upvotes: 2

MRocklin
MRocklin

Reputation: 57281

No, these need to be two separate methods. If you try this then Dask will tell you in a nice error message.

In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv('*.csv', index='my-index')
ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead

But this won't be any slower or faster than doing it the other way.

Upvotes: 8

Related Questions