Reputation: 665
I have the following DataFrame (in reality I'm working with around 20 million rows):
shop month day sale
1 7 1 10
1 6 1 8
1 5 1 9
2 7 1 10
2 6 1 8
2 5 1 9
I want another column: "Prev month sales", where sales are equal to the "Sales of previous month with same day, e.g.
shop month day sale prev month sale
1 7 1 10 8
1 6 1 8 9
1 5 1 9 9
2 7 1 10 8
2 6 1 8 9
2 5 1 9 9
Upvotes: 2
Views: 399
Reputation: 40878
One solution using .concat()
, set_index()
, and .loc[]
:
# Get index of (shop, previous month, day).
# This will serve as a unique index to look up prev. month sale.
prev = pd.concat((df.shop, df.month - 1, df.day), axis=1)
# Unfortunately need to convert to list of tuples for MultiIndexing
prev = pd.MultiIndex.from_arrays(prev.values.T)
# old: [tuple(i) for i in prev.values]
# Now call .loc on df to look up each prev. month sale.
sale_prev_month = df.set_index(['shop', 'month', 'day']).loc[prev]
# And finally just concat rather than merge/join operation
# because we want to ignore index & mimic a left join.
df = pd.concat((df, sale_prev_month.reset_index(drop=True)), axis=1)
shop month day sale sale
0 1 7 1 10 8.0
1 1 6 1 8 9.0
2 1 5 1 9 NaN
3 2 7 1 10 8.0
4 2 6 1 8 9.0
5 2 5 1 9 NaN
Your new column will be float, not int, because of the presence of NaNs.
I don't use dask day to day so this is probably woefully sub-par. Trying to work around the fact that dask does not implement pandas' MultiIndex. So, you can concatenate your three existing indices into a string column and lookup on that.
import dask.dataframe as dd
# Play around with npartitions or chunksize here!
df2 = dd.from_pandas(df, npartitions=10)
# Get a *single* index of unique (shop, month, day IDs)
# Dask doesn't support MultiIndex
empty = pd.Series(np.empty(len(df), dtype='object')) # Passed to `meta`
current = df2.loc[:, on].apply(lambda col: '_'.join(col.astype(str)), axis=1,
meta=empty)
prev = df2.loc[:, on].assign(month=df2['month'] - 1)\
.apply(lambda col: '_'.join(col.astype(str)), axis=1, meta=empty)
df2 = df2.set_index(current)
# We know have two dask.Series, `current` and `prev`, in the
# concatenated format "shop_month_day".
# We also have a dask.DataFrame, df2, which is indexed by `current`
# I would think we could just call df2.loc[prev].compute(), but
# that's throwing a KeyError for me, so slightly more expensive:
sale_prev_month = df2.compute().loc[prev.compute()][['sale']]\
.reset_index(drop=True)
# Now just concat as before
# Could re-break into dask objects here if you really needed to
df = pd.concat((df, sale_prev_month.reset_index(drop=True)), axis=1)
Upvotes: 2