Stackbeans
Stackbeans

Reputation: 279

importing large CSV file using Dask

I am importing a very large csv file ~680GB using Dask, however, the output is not what I expect. My aim is to select only some columns (6/50), and perhaps filter them (this I am unsure of because there seems to be no data?):

import dask.dataframe as dd

file_path = "/Volumes/Seagate/Work/Tickets/Third ticket/Extinction/species_all.csv"

cols = ['year', 'species', 'occurrenceStatus', 'individualCount', 'decimalLongitude', 'decimalLatitde']
dataset = dd.read_csv(file_path, names=cols,usecols=[9, 18, 19, 21, 22, 32])

When I read it into Jupyter, I cannot understand the output - the console outputs:

Dask DataFrame Structure:
                     year species occurrenceStatus individualCount decimalLongitude decimalLatitde
npartitions=11397                                                                                 
                   object  object           object          object           object         object
                      ...     ...              ...             ...              ...            ...
...                   ...     ...              ...             ...              ...            ...
                      ...     ...              ...             ...              ...            ...
                      ...     ...              ...             ...              ...            ...
Dask Name: read-csv, 11397 tasks

Upvotes: 3

Views: 2676

Answers (2)

avriiil
avriiil

Reputation: 382

1. Lazy Computation

Dask evaluates lazily. Calling dataset alone doesn't trigger any computation. You'll need to call dataset.compute() or dataset.persist() to trigger computation and inspect the dataframe. The suggestion by the existing answer to use dataframe.head() is essentially calling .compute() on a subset of the data. Read more about what that means here in the Dask docs

2. Column Pruning

You may want to consider converting your dataset to Parquet. From this resource: "Parquet lets you read specific columns from a dataset without reading the entire file. This is called column pruning and can be a massive performance improvement."

Toy code example

# read in your csv
dataset= dd.read_csv('your.csv')

# store as parquet
dataset.to_parquet('your.parquet', engine='pyarrow')

# read in parquet file with selected columns
dataset = dd.read_parquet('your.parquet', columns=list_of_columns)
dataset.head()

Upvotes: 0

SultanOrazbayev
SultanOrazbayev

Reputation: 16551

It looks like you've successfully created a dask dataframe. If you are expecting something like a pandas dataframe, then you can get a peek at the data with dataset.head(). For more involved computations it's best to keep the dataset lazy (as a dask dataframe), and use the standard pandas syntax for all transformations.

# this is needed to call dask.compute
import dask

# for example take a subset
subset_data = dataset[dataset['year']>2000]

# find out the total value for this column
lazy_result = subset_data['individualCount'].sum()

# now that the target is known use .compute
computed_result = dask.compute(lazy_result)

Apart from dask, you can also look at vaex, which for some purposes might be better: https://vaex.io/

Upvotes: 1

Related Questions