shantanuo
shantanuo

Reputation: 32218

dask dataframe issue with string conversion

I can easily convert a string to date in pandas as shown here...

df.date = pd.to_datetime(df.date, format="%m/%d/%Y")

There seems to be no easy way in dask?

Here is the pandas example that works with dates:

import pandas as pd

url="http://web.mta.info/developers/data/nyct/turnstile/turnstile_170128.txt"
df=pd.read_csv(url)

df.info()

df.columns=['ca', 'unit', 'scp', 'station', 'inename', 'division', 'date', 'time', 'desc', 'entries', 'exits']

df.date = pd.to_datetime(df.date, format="%m/%d/%Y")

And here is dask that works but can not convert string:

link = 'http://web.mta.info/developers/'

data = ['data/nyct/turnstile/turnstile_170128.txt',
                        'data/nyct/turnstile/turnstile_170121.txt',
                        'data/nyct/turnstile/turnstile_170114.txt',
                        'data/nyct/turnstile/turnstile_170107.txt' 
        ]

urls=[]
for i in data:
    urls.append(link+i)

import pandas as pd
import dask
import dask.dataframe as dd

ddfs = [dask.delayed(pd.read_csv)(url) for url in urls]

ddf = dd.from_delayed(ddfs)

ddf.columns=['ca', 'unit', 'scp', 'station', 'inename', 'division', 'date', 'time', 'desc', 'entries', 'exits']

How do I convert the string to date?

Upvotes: 2

Views: 1976

Answers (1)

MRocklin
MRocklin

Reputation: 57319

Edit

This has been added to Dask dataframe

dd.to_datetime(...)

Previous answer

Do this with the parse_dates= keyword to pd.read_csv

ddfs = [dask.delayed(pd.read_csv)(url, parse_dates=['DATE']) for url in urls]

Or you can even combine the DATE and TIME columns in your original data to a single column

ddfs = [dask.delayed(pd.read_csv)(url, parse_dates={'DATETIME': ['DATE', 'TIME']}) for url in urls]

Use map_partitions

If you have a dataframe with an object dtype column you can always use map_partitions to apply a pandas function to every partition. You should also give map partitions the expected type of the output.

ddf['date'] = ddf['date'].map_partitions(pd.to_datetime, format='%m/%d/%Y',
                                         meta=('date', 'M8[ns]'))

This is generally a good way to cover Pandas functionality for which there is no dask.dataframe API.

Upvotes: 3

Related Questions