Chaouki
Chaouki

Reputation: 465

How to get statistic from array in a Spark dataframe?

I'm working with a dataframe

    df.printSchema()
     root
       |-- key_value: struct (nullable = true)
       |    |-- key: string (nullable = true)
       |    |-- value: string (nullable = true)
    df.show(5)
    |key_value
    |[k1,v1]
    |[k1,v2]
    |[k2,v3
    |[k3,v6]
    |[k4,v5]

I want to get the number of distinct keys in My dataframe, so I try to construct a dataframe that contains a column key and value using explode but I didn't get a result.

   val f=df.withColumn("k",explode(col("key_value")))
   org.apache.spark.sql.AnalysisException: cannot resolve 'explode(`key_value`)' due to data type mismatch: input to function explode should be array or map type, not StructType(StructField(key,StringType,true), StructField(value,StringType,true));;

any help?

Upvotes: 0

Views: 60

Answers (1)

Mikel San Vicente
Mikel San Vicente

Reputation: 3863

You could do this

import spark.implicits._    
df.select($"key_value.key").distinct.count

the explode function is applied on array fields, in this case neither key_value or key are an array.

Upvotes: 1

Related Questions