Reputation: 21
I have imported a parquet file of approx. 800MB with ~50 millions rows into dask dataframe.
There are 5 columns: DATE, TICKER, COUNTRY, RETURN, GICS
Questions:
I simply tried to do the follow:
import dask.dataframe as dd
dd.read_parquet('.\abc.gzip')
df['INDUSTRY'] = df.GICS.str[0:4]
n = df.INDUSTRY.unique().compute()
and it takes forever to return. Am I doing anything wrong here? partitions are automatically set to 1.
I am trying to do something like df[df.INDUSTRY == '4010'].compute()
, it also takes forever to return or crash.
Upvotes: 2
Views: 176
Reputation: 320
To answer your questions:
astype
.You can't convert a string to date within a the read, though if you use the map_partitions
function, documented here you can convert the column to date, as in this example:
import dask.dataframe as dd
df = dd.read_parquet(your_file)
meta = ('date', 'datetime64[ns]')
# you can add your own date format, or just let pandas guess
to_date_time = lambda x: pd.to_datetime(x, format='%Y-%m-%d')
df['date_clean'] = df.date.map_partitions(to_date_time, meta=meta)
The map_partitions
function will convert the dates on each chunk of the parquet when the file is computed, making it functionally the same as converting the date when the file is read in.
Here I think again you would benefit from using the map_partitions
function, so you might try something like this
import dask.dataframe as dd
df = dd.read_parquet('.\abc.gzip')
df['INDUSTRY']df.GICS.map_partitions(lambda x: x.str[0:4], meta=('INDUSTRY', 'str'))
df[df.INDUSTRY == '4010']
Note that if you run compute
the object is converted to pandas. If the file is too large than Dask won't be able to compute it, and thus nothing will be returned. Without seeing the data it's harder to say more, but do checkout these tools to profile your computations to see if you are leveraging all your CPUs.
Upvotes: 3