Reputation: 279
I am importing a very large csv file ~680GB using Dask, however, the output is not what I expect. My aim is to select only some columns (6/50), and perhaps filter them (this I am unsure of because there seems to be no data?):
import dask.dataframe as dd
file_path = "/Volumes/Seagate/Work/Tickets/Third ticket/Extinction/species_all.csv"
cols = ['year', 'species', 'occurrenceStatus', 'individualCount', 'decimalLongitude', 'decimalLatitde']
dataset = dd.read_csv(file_path, names=cols,usecols=[9, 18, 19, 21, 22, 32])
When I read it into Jupyter, I cannot understand the output - the console outputs:
Dask DataFrame Structure:
year species occurrenceStatus individualCount decimalLongitude decimalLatitde
npartitions=11397
object object object object object object
... ... ... ... ... ...
... ... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Dask Name: read-csv, 11397 tasks
Upvotes: 3
Views: 2676
Reputation: 382
1. Lazy Computation
Dask evaluates lazily. Calling dataset
alone doesn't trigger any computation. You'll need to call dataset.compute()
or dataset.persist()
to trigger computation and inspect the dataframe. The suggestion by the existing answer to use dataframe.head()
is essentially calling .compute()
on a subset of the data. Read more about what that means here in the Dask docs
2. Column Pruning
You may want to consider converting your dataset to Parquet. From this resource: "Parquet lets you read specific columns from a dataset without reading the entire file. This is called column pruning and can be a massive performance improvement."
Toy code example
# read in your csv
dataset= dd.read_csv('your.csv')
# store as parquet
dataset.to_parquet('your.parquet', engine='pyarrow')
# read in parquet file with selected columns
dataset = dd.read_parquet('your.parquet', columns=list_of_columns)
dataset.head()
Upvotes: 0
Reputation: 16551
It looks like you've successfully created a dask dataframe. If you are expecting something like a pandas dataframe, then you can get a peek at the data with dataset.head()
. For more involved computations it's best to keep the dataset lazy (as a dask dataframe), and use the standard pandas
syntax for all transformations.
# this is needed to call dask.compute
import dask
# for example take a subset
subset_data = dataset[dataset['year']>2000]
# find out the total value for this column
lazy_result = subset_data['individualCount'].sum()
# now that the target is known use .compute
computed_result = dask.compute(lazy_result)
Apart from dask, you can also look at vaex
, which for some purposes might be better: https://vaex.io/
Upvotes: 1