Reputation: 18042
I have a DataFrame
with the following simple schema
:
root
|-- amount: double (nullable = true)
|-- Date: timestamp (nullable = true)
I was trying to see the sum
of amounts per day and per hour, some like:
+---+--------+--------+ ... +--------+
|day| 0| 1| | 23|
+---+--------+--------+ ... +--------+
|148| 306.0| 106.0| | 0.0|
|243| 1906.0| 50.0| | 1.0|
| 31| 866.0| 100.0| | 0.0|
+---+--------+--------+ ... +--------+
Well, first I added a column hour
and then I grouped by day, and pivoted by hour. However, I got an exception, which perhaps is related to missing sales for some hours. This is what I'm trying to fix but I haven't realized how.
(df.withColumn("hour", hour("date"))
.groupBy(dayofyear("date").alias("day"))
.pivot("hour")
.sum("amount").show())
An excerpt of the exception.
AnalysisException: u'resolved attribute(s) date#3972 missing from day#5367,hour#5354,sum(amount)#5437 in operator !Aggregate [dayofyear(cast(date#3972 as date))], [dayofyear(cast(date#3972 as date)) AS day#5367, pivotfirst(hour#5354, sum(amount)#5437, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 0) AS __pivot_sum(amount) AS sum(amount)#5487];'
Upvotes: 1
Views: 213
Reputation: 330363
The problem is unresolved day
column. You can create it outside groupBy
clause to address that:
df = (sc
.parallelize([
(1.0, "2016-03-30 01:00:00"), (30.2, "2015-01-02 03:00:02")])
.toDF(["amount", "Date"])
.withColumn("Date", col("Date").cast("timestamp"))
.withColumn("hour", hour("date")))
with_day = df.withColumn("day", dayofyear("Date"))
with_day.groupBy("day").pivot("hour", range(0, 24)).sum("amount")
values
argument for pivot
is optional but advisable.
Upvotes: 2