How to aggregate by day with multiple columns [Pyspark]?

Question

I want to convert below pandas code to pysaprk.

d = {'has_discount':'count',
    'clearance':'count',
    'count': ['count', 'sum'],
    'price_guide':'max'}

df.index = timestamp2datetime(df.time_create, unit='ms')
df1 = df.resample('D').agg(d)

df1.columns = df1.columns.map('_'.join)
d1 = {'has_discount_count':'discount_order_count',
    'clearance_count':'clearance_order_count',
    'count_count':'order_count',
    'count_sum':'sale_count',
    'price_guide_max':'price_guide'}

df2 = df1.rename(columns=d1)

However there is no resmaple in pysaprk, try to use groupby instead:

d = {'has_discount':'count',
    'clearance':'count',
    'count': ['count', 'sum'],
    'price_guide':'max'}

df.select(date_format(from_unixtime(df.time_create/1000),'yyyy-MM-dd').alias('day')).groupby('day').agg(d).show(5)

But got error

AnalysisException: u'Cannot resolve column name "price_guide" among (day);'

Pyspark's aggregation seems not support input like d . What should I do?

vvg · Accepted Answer

df.select you're using leave you with only one column day, but in aggregation statetement you're using other columns. What you probably want, is to add column day to others that exists:

df.withColumn('day', date_format(from...

How to aggregate by day with multiple columns [Pyspark]?

Answers (1)

Related Questions