Flatten data in a spark sql query - Spark Dataframe

Question

I'm trying to fetch unique values of a column in a table and print it along side the other columns such as sum, tablename as shown in the query below. The subquery shown below is fetching multiple rows which is valid but in order to put that in a column, I need to flatten it out as a list or text, e.g [1,2,1]. How can I flatten this data out to fit in the same query as shown below. My preferred approach is to use Spark SQL. Please help.

    **Subquery**

    (select distinct $field from tablename) 

    **Actual Code** 

    RDDdf.createOrReplaceTempView(“tablename”)

    val x=  RDDdf.schema.fieldNames

    val dfs = x.map(field => spark.sql(s"select 'RDDdf' as TableName, 
   '$field'as column, sum($field) as Sum, (select distinct $field from 
    tablename) from tablename"))

    **Expected output**

    TableName | column  | Sum | UniqueVal
    ----------+---------+-----+----------
    RDDdf     | a       | 4   | 1,2,1

T. Gawęda · Accepted Answer

Use built-in function collect_list:

select collect_list(value) from (select distinct($field) as value from tablename) as values

Be aware, that this will be VERY slow on large datasets

Flatten data in a spark sql query - Spark Dataframe

Answers (1)

Related Questions