holymonkey
holymonkey

Reputation: 21

Performance and data manipulation on Dask

I have imported a parquet file of approx. 800MB with ~50 millions rows into dask dataframe. There are 5 columns: DATE, TICKER, COUNTRY, RETURN, GICS

Questions:

  1. How can I specify data type in read_parquet or I have to do it with astype?
  2. Can I parse date within read_parquet
  3. I simply tried to do the follow:

    import dask.dataframe as dd
    dd.read_parquet('.\abc.gzip')
    df['INDUSTRY'] = df.GICS.str[0:4]
    n = df.INDUSTRY.unique().compute()
    

    and it takes forever to return. Am I doing anything wrong here? partitions are automatically set to 1.

I am trying to do something like df[df.INDUSTRY == '4010'].compute(), it also takes forever to return or crash.

Upvotes: 2

Views: 176

Answers (1)

Benjamin Cohen
Benjamin Cohen

Reputation: 320

To answer your questions:

  1. A parquet file has types stored, as noted in the Apache docs here, thus you won't be able to change the data type when you read the file in, meaning you'll have to use astype.
  2. You can't convert a string to date within a the read, though if you use the map_partitions function, documented here you can convert the column to date, as in this example:

    import dask.dataframe as dd
    df = dd.read_parquet(your_file)
    meta = ('date', 'datetime64[ns]')
    # you can add your own date format, or just let pandas guess
    to_date_time = lambda x: pd.to_datetime(x, format='%Y-%m-%d')
    df['date_clean'] = df.date.map_partitions(to_date_time, meta=meta)
    

    The map_partitions function will convert the dates on each chunk of the parquet when the file is computed, making it functionally the same as converting the date when the file is read in.

  3. Here I think again you would benefit from using the map_partitions function, so you might try something like this

    import dask.dataframe as dd
    df = dd.read_parquet('.\abc.gzip')
    df['INDUSTRY']df.GICS.map_partitions(lambda x: x.str[0:4], meta=('INDUSTRY', 'str'))
    df[df.INDUSTRY == '4010']
    

    Note that if you run compute the object is converted to pandas. If the file is too large than Dask won't be able to compute it, and thus nothing will be returned. Without seeing the data it's harder to say more, but do checkout these tools to profile your computations to see if you are leveraging all your CPUs.

Upvotes: 3

Related Questions