raju
raju

Reputation: 4978

How to merge rows using SQL only?

I can neither use pyspark or scala. I can only write SQL code. I have a table with 2 columns item id, name.

item_id, name
1        name1
1        name2
1        name3
2        name4
2        name5

I want to generate results with the names of an item_id concatenated.

item_id,    names
1           name1-name2-name3
2           name4-name5

How do I create such a table with Spark sql?

Upvotes: 2

Views: 157

Answers (3)

Jacek Laskowski
Jacek Laskowski

Reputation: 74619

The beauty of Spark SQL is that once you have a solution in any of the supported languages (Scala, Java, Python, R or SQL) you can somewhat figure out other variants.

The following SQL statement seems doing what you ask for:

SELECT item_id, array_join(collect_list(name), '-') as names 
FROM tableName
GROUP BY item_id

In spark-shell it gives the following result:

scala> sql("select item_id, array_join(collect_list(name), '-') as names from so group by item_id").show
+-------+-----------------+
|item_id|            names|
+-------+-----------------+
|      1|name1-name2-name3|
|      2|      name4-name5|
+-------+-----------------+

Upvotes: 2

david gupta
david gupta

Reputation: 56

You can use Spark data frame's groupBy and agg methods and concat_ws function:

df.groupBy($"item_id").agg(concat_ws("-", collect_list($"name")).alias("names")).show()

Group fields by item_id and aggregating each name field by concatenating them together.

Upvotes: 0

Fahmi
Fahmi

Reputation: 37473

You can try the below -

df.orderBy('names', ascending=False)
    .groupBy('item_id')
    .agg(
        array_join(
            collect_list('names'),
            delimiter='-',
        ).alias('names')
    )

Upvotes: 1

Related Questions