Reputation: 880
I would like to scale some operations I do on pandas dataframe using dask 2.14. For example I would like to apply a shift on a column of a dataframe:
import dask.dataframe as dd
data = dd.read_csv('some_file.csv')
data.set_index('column_A')
data['column_B'] = data.groupby(['column_A'])['column_B'].shift(-1)
but I get AttributeError: 'SeriesGroupBy' object has no attribute 'shift'
I read the dask documentation and I saw that there is not such a method (while there was in pandas)
Can you suggest some valid alternative?
Thank you
Upvotes: 5
Views: 1524
Reputation: 13175
There is an open ticket about this on GitHub. Essentially, you will have to use apply
to get around it. I'm not sure whether this carries performance implications in dask
. There is a further ticket referencing the issue and stating that it lies in pandas
, but it's been open for some time.
This should be equivalent to the pandas
operation:
import dask.dataframe as dd
import pandas as pd
import random
df = pd.DataFrame({'a': list(range(10)),
'b': random.choices(['x', 'y'], k=10)})
print("####### PANDAS ######")
print("Initial df")
print(df.head(10))
print("................")
pandas_df = df.copy()
print("Final df")
pandas_df['a'] = pandas_df.groupby(['b'])['a'].apply(lambda x: x.shift(-1))
print(pandas_df.head(10))
print()
print("####### DASK ######")
print("Initial df")
dask_df = dd.from_pandas(df, npartitions=1).reset_index()
print(dask_df.head(10))
print("................")
dask_df['a'] = dask_df.groupby(['b'])['a'].apply(lambda x: x.shift(-1))
print("Final df")
print(dask_df.head(10))
I obviously can't benchmark the approach in dask
since there seems to be no alternative. However, I can in pandas
:
import string
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': list(range(100000)),
'b': np.random.choice(list(string.ascii_lowercase), 100000)
})
def normal_way(df):
df = df.groupby(['b'])['a'].shift(-1)
def apply_way(df):
df = df.groupby(['b'])['a'].apply(lambda x: x.shift(-1))
The timeit
results are:
%timeit normal_way(df)
4.25 ms ± 98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit apply_way(df)
15 ms ± 446 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 8