I have imported a parquet file of approx. 800MB with ~50 millions rows into dask dataframe. There are 5 columns: DATE, TICKER, COUNTRY, RETURN, GICS Questions: How can I specify data type in read_parquet or I have to do it with astype? Can I parse date within read_parquet I simply tried to do the follow: import dask.dataframe as dd dd.read_parquet('.\abc.gzip') df['INDUSTRY'] = df.GICS.str[0:4] n = df.INDUSTRY.unique().compute() and it takes forever to return. Am I doing anything wrong here? partitions are automatically set to 1. I am trying to do something like df[df.INDUSTRY == '4010'].compute() , it also takes forever to return or crash.

Reputation: 21

Performance and data manipulation on Dask

I have imported a parquet file of approx. 800MB with ~50 millions rows into dask dataframe. There are 5 columns: DATE, TICKER, COUNTRY, RETURN, GICS

Questions:

How can I specify data type in read_parquet or I have to do it with astype?
Can I parse date within read_parquet
I simply tried to do the follow:
```
import dask.dataframe as dd
dd.read_parquet('.\abc.gzip')
df['INDUSTRY'] = df.GICS.str[0:4]
n = df.INDUSTRY.unique().compute()
```
and it takes forever to return. Am I doing anything wrong here? partitions are automatically set to 1.

I am trying to do something like df[df.INDUSTRY == '4010'].compute(), it also takes forever to return or crash.

Upvotes: 2

Answers (1)

Benjamin Cohen

Reputation: 320

To answer your questions:

A parquet file has types stored, as noted in the Apache docs here, thus you won't be able to change the data type when you read the file in, meaning you'll have to use astype.
You can't convert a string to date within a the read, though if you use the map_partitions function, documented here you can convert the column to date, as in this example:
```
import dask.dataframe as dd
df = dd.read_parquet(your_file)
meta = ('date', 'datetime64[ns]')
# you can add your own date format, or just let pandas guess
to_date_time = lambda x: pd.to_datetime(x, format='%Y-%m-%d')
df['date_clean'] = df.date.map_partitions(to_date_time, meta=meta)
```
The map_partitions function will convert the dates on each chunk of the parquet when the file is computed, making it functionally the same as converting the date when the file is read in.
Here I think again you would benefit from using the map_partitions function, so you might try something like this
```
import dask.dataframe as dd
df = dd.read_parquet('.\abc.gzip')
df['INDUSTRY']df.GICS.map_partitions(lambda x: x.str[0:4], meta=('INDUSTRY', 'str'))
df[df.INDUSTRY == '4010']
```
Note that if you run compute the object is converted to pandas. If the file is too large than Dask won't be able to compute it, and thus nothing will be returned. Without seeing the data it's harder to say more, but do checkout these tools to profile your computations to see if you are leveraging all your CPUs.

Upvotes: 3

Performance and data manipulation on Dask

Answers (1)

Related Questions