Alberto Bonsanto
Alberto Bonsanto

Reputation: 18042

Pivoting with missing values

I have a DataFrame with the following simple schema:

root
 |-- amount: double (nullable = true)
 |-- Date: timestamp (nullable = true)

I was trying to see the sum of amounts per day and per hour, some like:

+---+--------+--------+ ... +--------+
|day|       0|       1|     |      23|
+---+--------+--------+ ... +--------+
|148|   306.0|   106.0|     |     0.0|
|243|  1906.0|    50.0|     |     1.0|
| 31|   866.0|   100.0|     |     0.0|
+---+--------+--------+ ... +--------+

Well, first I added a column hour and then I grouped by day, and pivoted by hour. However, I got an exception, which perhaps is related to missing sales for some hours. This is what I'm trying to fix but I haven't realized how.

(df.withColumn("hour", hour("date"))
    .groupBy(dayofyear("date").alias("day"))
    .pivot("hour")
    .sum("amount").show())

An excerpt of the exception.

AnalysisException: u'resolved attribute(s) date#3972 missing from day#5367,hour#5354,sum(amount)#5437 in operator !Aggregate [dayofyear(cast(date#3972 as date))], [dayofyear(cast(date#3972 as date)) AS day#5367, pivotfirst(hour#5354, sum(amount)#5437, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 0) AS __pivot_sum(amount) AS sum(amount)#5487];'

Upvotes: 1

Views: 213

Answers (1)

zero323
zero323

Reputation: 330363

The problem is unresolved day column. You can create it outside groupBy clause to address that:

df = (sc
         .parallelize([
             (1.0, "2016-03-30 01:00:00"), (30.2, "2015-01-02 03:00:02")])
        .toDF(["amount", "Date"])
        .withColumn("Date", col("Date").cast("timestamp"))
        .withColumn("hour", hour("date")))

with_day = df.withColumn("day", dayofyear("Date"))
with_day.groupBy("day").pivot("hour", range(0, 24)).sum("amount")

values argument for pivot is optional but advisable.

Upvotes: 2

Related Questions