Rishi S
Rishi S

Reputation: 385

How to merging rows in a spark data set to combine a string column

I need to merge two or more rows in a dataset into one. The grouping has to be done based on an id column. The column to be merged is a string. I need to get a comma-separated string in the merged column. How do I achieve this is Java? Input rows

col1,col2  
1,abc  
2,pqr  
1,abc1  
3,xyz
2,pqr1

Expected output :

col1, col2  
1, "abc,abc1"  
2, "pqr,pqr1"  
3, xyz  

Upvotes: 2

Views: 4118

Answers (2)

Rahul Barman
Rahul Barman

Reputation: 1

Use groupBy and concat_ws

import org.apache.spark.sql.functions._
df.groupBy("col1").agg(concat_ws(",", collect_list("col2")))

Upvotes: 0

Oli
Oli

Reputation: 10406

To aggregate two separate columns:

your_data_frame
    .withColumn("aggregated_column", concat_ws(",", col("col1"), col("col2"))

Just in case, here is what to import besides the usual stuff

import static org.apache.spark.sql.functions.*;

Edit

If you want to aggregate an arbitrary number of columns that you know by name, you can do it this way:

String[] column_names = {"c1", "c2", "c3"};
Column[] columns = Arrays.asList(column_names)
            .stream().map(x -> col(x))
            .collect(Collectors.toList())
            .toArray(new Column[0]);
data_frame
    .withColumn("agg", concat_ws(",", columns));

Edit #2: group by and concat

In case you want to group by a column "ID" and aggregate another column, you can do it this way:

dataframe
    .groupBy("ID")
    .agg(concat_ws(",", collect_list(col("col1")) ))

Upvotes: 4

Related Questions