Mithril
Mithril

Reputation: 13718

How to aggregate by day with multiple columns [Pyspark]?

I want to convert below pandas code to pysaprk.

d = {'has_discount':'count',
    'clearance':'count',
    'count': ['count', 'sum'],
    'price_guide':'max'}

df.index = timestamp2datetime(df.time_create, unit='ms')
df1 = df.resample('D').agg(d)

df1.columns = df1.columns.map('_'.join)
d1 = {'has_discount_count':'discount_order_count',
    'clearance_count':'clearance_order_count',
    'count_count':'order_count',
    'count_sum':'sale_count',
    'price_guide_max':'price_guide'}

df2 = df1.rename(columns=d1)

However there is no resmaple in pysaprk, try to use groupby instead:

d = {'has_discount':'count',
    'clearance':'count',
    'count': ['count', 'sum'],
    'price_guide':'max'}

df.select(date_format(from_unixtime(df.time_create/1000),'yyyy-MM-dd').alias('day')).groupby('day').agg(d).show(5)

But got error

AnalysisException: u'Cannot resolve column name "price_guide" among (day);'

Pyspark's aggregation seems not support input like d . What should I do?

Upvotes: 2

Views: 696

Answers (1)

vvg
vvg

Reputation: 6385

df.select you're using leave you with only one column day, but in aggregation statetement you're using other columns. What you probably want, is to add column day to others that exists:

df.withColumn('day', date_format(from...

Upvotes: 1

Related Questions