Reputation: 1815

Concatenate string on grouping with the other column pyspark

I have dataframe as follows

+----------+--------------------+
|CustomerNo| desc               |
+----------+--------------------+
|  351856.0| FORM & BEAUTY 075 P|
|  351856.0| FORM & BEAUTY 075 P|
|  326022.0|            D 151 HP|
|   69430.0|Shape Sensation 0...|
|   38018.0|   Maximizer 846 WHU|
|   69712.0|Shape Sensation 0...|
|   71228.0|   Aqua Festive WHUD|
|   71228.0|Maximizer 736 WHU...|
|   73200.0|  T-Shirt Bra 081 HP|
|   73200.0|  T-Shirt Bra 081 HP|
|   73200.0|  T-Shirt Bra 081 HP|
|   74540.0|Form & Beauty 052 HP|
|   74578.0|          G 56 WP 01|
|   74578.0|          G 56 WP 01|
|   74578.0|          G 56 WP 01|
|   74578.0|          G 56 WP 01|
|   74578.0|          G 56 WP 01|
|   74578.0|          G 56 WP 01|
|   74578.0|          G 56 WP 01|
|   74578.0|          G 56 WP 01|
+----------+--------------------+

I need to group by the data on CustomerNo and concatenate the string column.

I am using the below code but it is giving error

df = retail_df.groupBy('CustomerNo').agg(F.concat('desc').alias('concat_Desc'))

Can anyone tell me how to do this?

Upvotes: 1

Answers (3)

rana pratap

Reputation: 21

import pyspark.sql.functions as F
df = spark.read.option("inferschema","true").option("header","true").csv("/FileStore/tables/test.csv")
print("Sample Data")
df.select("eid","ename").show()
print("")
print("")
print("Final Ouput")
df.select("eid","ename").groupBy("eid").agg(F.concat_ws(", ", F.collect_list("ename")).alias("desc")).show()

Sample Data
+---+-----+
|eid|ename|
+---+-----+
|  1|    a|
|  1|    b|
|  2|    c|
|  2|    d|
|  3|    e|
|  4|    f|
|  4|    g|
+---+-----+



Final Ouput
+---+----+
|eid|desc|
+---+----+
|  1|a, b|
|  3|   e|
|  4|f, g|
|  2|c, d|
+---+----+

Upvotes: 0

Sreeram TP

Reputation: 11917

You can groupby the dataframe on CustomerNo and then do a collect list. Following which, you can concat the items of the list of a single column using concat_ws

See the code below,

retail_df \
    .groupBy('CustomerNo') \
    .agg(F.collect_list('desc').alias('items')) \
    .withColumn("concat_Desc", F.concat_ws(",", "items"))

This solution is not using an udf hence will be better in terms of performance.

Upvotes: 1

Ric S

Reputation: 9247

It is not very clear what your desired output is, but if I understood correctly you would like something like the following solution, which uses collect_list to group all items in an array, and a udf to join the elements of such array into a string:

import pyspark.sql.functions as F

@F.udf('string')
def concat_into_string(l):
  return ' - '.join(l)

df = retail_df \
  .groupBy('CustomerNo').agg(F.collect_list('desc').alias('desc')) \
  .withColumn('final_string', concat_into_string('desc'))

Example

df = spark.createDataFrame([
  (1, 'A'),
  (1, 'A'),
  (2, 'B'),
  (3, 'C1'),
  (3, 'C2'),
  (4, 'D'),
  (4, 'D'),
  (4, 'D'),
  (4, 'D')
], ('CustomerNo', 'desc'))

df \
  .groupBy('CustomerNo').agg(F.collect_list('desc').alias('desc')) \
  .withColumn('final_string', concat_into_string('desc')) \
  .show()

+----------+------------+-------------+
|CustomerNo|        desc| final_string|
+----------+------------+-------------+
|         1|      [A, A]|        A - A|
|         2|         [B]|            B|
|         3|    [C1, C2]|      C1 - C2|
|         4|[D, D, D, D]|D - D - D - D|
+----------+------------+-------------+

Upvotes: 0

Concatenate string on grouping with the other column pyspark

Answers (3)

Example

Related Questions