Dimas Rizky
Dimas Rizky

Reputation: 374

Spark DataFrame aggregate multiple column into one column as a string

I want to convert a Spark DataFrame into another DataFrame with a specific manner as follows:

I have Spark DataFrame:

+---------+------------+
|protocol |   count    |
+---------+------------+
|      TCP|    8231    |
|     ICMP|    7314    |
|      UDP|    5523    |
|     IGMP|    4423    |
|      EGP|    2331    |
+---------+------------+

And I want to turn it into:

+----------------------------------------------------------+
|Aggregated                                                |
+----------------------------------------------------------+
|{tcp: 8231, icmp: 7314, udp: 5523, igmp: 4423, egp: 2331} |
+----------------------------------------------------------+

The aggregated column can be both list or map, or string. Is this possible with DataFrame functions or do I need to create my own udf to aggregate this ?

Upvotes: 0

Views: 1723

Answers (3)

Mann
Mann

Reputation: 307

Concat columns in dataframe and create a new column:

var new_df = df.withColumn("concat", concat($"protocol", lit(" : "), $"count"))

To aggregate it into a single row as a list you can do this.

var new_df = new_df.groupBy().agg(collect_list("concat").as("aggregated"))
new_df.show

If you want to get the data into a string instead of dataframe, you can collect it as following.

new_df.select("concat").collect.map(x=> x.get(0)).mkString("{", ",", "}")

Upvotes: 0

Alper t. Turker
Alper t. Turker

Reputation: 35219

pivot and toJSON will give you what you need

import org.apache.spark.sql.functions.first

df.groupBy().pivot("protocol").agg(first("count")).toJSON.show(false)
// +----------------------------------------------------------+                    
// |value                                                     |
// +----------------------------------------------------------+
// |{"EGP":2331,"ICMP":7314,"IGMP":4423,"TCP":8321,"UDP":5523}|
// +----------------------------------------------------------+

Upvotes: 2

Shaido
Shaido

Reputation: 28322

Since you want to convert all columns to a single one and it does not seem to be many columns to begin with, you can collect the dataframe to the driver and use pure Scala code to convert it into the format you want.

The following will give you a Array[String]:

val res = df.as[(String, Int)].collect.map{case(protocol, count) => protocol + ": " + count}

To convert it into a single string, simply do:

val str = res.mkString("{", ", ", "}")

Upvotes: 0

Related Questions