Sekar Ramu
Sekar Ramu

Reputation: 483

Spark dataframe missing columns when data are not available while using pivot function

I have source table with structure

source Type
a       aa  
a       aa 
a       bb 
a       cc

While using pivot to the source dataframe getting the resultant as

source aa bb cc
a       2  1  1

when the source table is empty then the resultant dataframe is just having the column 1 and pivot columns are not coming.

source

what I expect is something like below

source aa bb cc
0       0 0  0

Solutions tried:

val df = spark.sql(s"""
 select
 source,
 type
 from sourceTable
""")

val lscol = Seq(("aa"),("bb"),("cc")).toDF("valtype") df.join(lscol,df.col(“Type”)===lscol.col(“valtype”),”right”).drop(df(“Type”)).groupBy(“ID”).pivot(“Type”).agg(count(“Type”)).filter(!col(“source”).isNull).na.fill(0).createOrReplaceTempView("dummy_view")

which gave me a dummy_view with 2 columns as below

source null

Type : aa , bb, cc is standard. so if the value is not there in the table i want it to come as 0 instead of missing it.

Upvotes: 1

Views: 1649

Answers (2)

Aravind Palani
Aravind Palani

Reputation: 154

The pivot operation generates columns('aa','bb','cc') at the runtime based on the source data. As per you scenario if these are the 3 types ('aa','bb','cc') expected then the below hack with help you to proceed,

val result = source.groupBy("source").pivot("type").agg(count("type"))

val expectedDf = if (result.take(1).isEmpty) spark.range(1).selectExpr("0 as Source","0 as aa","0 as bb","0 as cc") else result

Upvotes: 0

Raphael Roth
Raphael Roth

Reputation: 27373

You can do it like this :

// create empty dataframe
val df = Seq(("dummy","dummy")).toDF("source","type").where(lit(false))

val lscol = Seq("aa",("bb"),("cc")).toDF("type")

df.join(lscol,Seq("type"),"right")
  .groupBy("source").pivot("Type").agg(sum(when(col("source").isNull,lit(0)).otherwise(lit(1))))
  .na.fill("0",Seq("source"))
  .show()

gives:

+------+---+---+---+
|source| aa| bb| cc|
+------+---+---+---+
|     0|  0|  0|  0|
+------+---+---+---+

But I would rather do it like

val newDF = if(df.isEmpty) {
  Seq(("0",0,0,0)).toDF("source","aa","bb","cc")
} else {
  // your ususal code
  throw new NotImplementedError()
}

newDF.show()

gives:

+------+---+---+---+
|source| aa| bb| cc|
+------+---+---+---+
|     0|  0|  0|  0|
+------+---+---+---+

Upvotes: 1

Related Questions