Reputation: 622
When using Python Pandas to read a CSV it is possible to specify the index column. Is this possible using Python Dask when reading the file, as opposed to setting the index afterwards?
For example, using pandas:
df = pandas.read_csv(filename, index_col=0)
Ideally using dask could this be:
df = dask.dataframe.read_csv(filename, index_col=0)
I have tried
df = dask.dataframe.read_csv(filename).set_index(?)
but the index column does not have a name (and this seems slow).
Upvotes: 12
Views: 21686
Reputation: 21
Now you can write: df = pandas.read_csv(filename, index_col='column_name')
(Where column name is the name of the column you want to set as the index).
Upvotes: 2
Reputation: 166
I know I'm a bit late, but this is the first result on google so it should get answered.
If you write your dataframe with:
# index = True is default
my_pandas_df.to_csv('path')
#so this is same
my_pandas_df.to_csv('path', index=True)
And import with Dask:
import dask.dataframe as dd
my_dask_df = dd.read_csv('path').set_index('Unnamed: 0')
It will use column 0 as your index (which is unnamed thanks to pandas.DataFrame.to_csv() ).
my_dask_df = dd.read_csv('path')
my_dask_df.columns
which returns
Index(['Unnamed: 0', 'col 0', 'col 1',
...
'col n'],
dtype='object', length=...)
Upvotes: 2
Reputation: 57281
No, these need to be two separate methods. If you try this then Dask will tell you in a nice error message.
In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv('*.csv', index='my-index')
ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead
But this won't be any slower or faster than doing it the other way.
Upvotes: 8