Append a value after every element in PySpark list Dataframe

I am having a dataframe like this

 Data           ID   

[1,2,3,4]        22

I want to create a new column and each and every entry in the new column will be value from Data field appended with ID by ~ symbol, like below

 Data         ID               New_Column

[1,2,3,4]     22               [1|22~2|22~3|22~4|22]

Note : In Data field the array size is not fixed one. It may not have entry or N number of entry will be there. Can anyone please help me to solve!

Upvotes: 1

Answers (3)

werner

Reputation: 14895

A udf can help:

def func(array, suffix):
    return '~'.join([str(x) + '|' + str(suffix) for x in array])

from pyspark.sql.types import StringType
from pyspark.sql import functions as F
my_udf = F.udf(func, StringType())

df.withColumn("New_Column", my_udf("Data", "ID")).show()

prints

+------------+---+-------------------+
|        Data| ID|      New_Column   |
+------------+---+-------------------+
|[1, 2, 3, 4]| 22|22~1|22~2|22~3|22~4|
+------------+---+-------------------+

Upvotes: 1

Shubham Jain

Reputation: 5536

Spark 2.4+

Pyspark equivalent for the same goes like

df = spark.createDataFrame([(22, [1,2,3,4]),(23, [1,2,3,4,5,6,7,8]),(24, [])],['Id','Data'])

df.show()

+---+--------------------+
| Id|                Data|
+---+--------------------+
| 22|        [1, 2, 3, 4]|
| 23|[1, 2, 3, 4, 5, 6...|
| 24|                  []|
+---+--------------------+

df.withColumn('ff', f.when(f.size('Data')==0,'').otherwise(f.expr('''concat_ws('~',transform(Data, x->concat(x,'|',Id)))'''))).show(20,False)

+---+------------------------+---------------------------------------+
|Id |Data                    |ff                                     |
+---+------------------------+---------------------------------------+
|22 |[1, 2, 3, 4]            |1|22~2|22~3|22~4|22                    |
|23 |[1, 2, 3, 4, 5, 6, 7, 8]|1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23|
|24 |[]                      |                                       |
+---+------------------------+---------------------------------------+

If you want final output as array

df.withColumn('ff',f.array(f.when(f.size('Data')==0,'').otherwise(f.expr('''concat_ws('~',transform(Data, x->concat(x,'|',Id)))''')))).show(20,False)

+---+------------------------+-----------------------------------------+
|Id |Data                    |ff                                       |
+---+------------------------+-----------------------------------------+
|22 |[1, 2, 3, 4]            |[1|22~2|22~3|22~4|22]                    |
|23 |[1, 2, 3, 4, 5, 6, 7, 8]|[1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23]|
|24 |[]                      |[]                                       |
+---+------------------------+-----------------------------------------+

Hope this helps

Upvotes: 2

mvasyliv

Reputation: 1214

package spark

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object DF extends App {

  val spark = SparkSession.builder()
    .master("local")
    .appName("DataFrame-example")
    .getOrCreate()

  import spark.implicits._

  val df = Seq(
    (22, Seq(1,2,3,4)),
    (23, Seq(1,2,3,4,5,6,7,8)),
    (24, Seq())
  ).toDF("ID", "Data")

  val arrUDF = udf((id: Long, array: Seq[Long]) => {
    val r = array.size match {
      case 0 => ""
      case _ => array.map(x => s"$x|$id").mkString("~")
    }

    s"[$r]"
  })

  val resDF = df.withColumn("New_Column", lit(arrUDF('ID, 'Data)))

  resDF.show(false)
  //+---+------------------------+-----------------------------------------+
  //|ID |Data                    |New_Column                               |
  //+---+------------------------+-----------------------------------------+
  //|22 |[1, 2, 3, 4]            |[1|22~2|22~3|22~4|22]                    |
  //|23 |[1, 2, 3, 4, 5, 6, 7, 8]|[1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23]|
  //|24 |[]                      |[]                                       |
  //+---+------------------------+-----------------------------------------+




}

Upvotes: 2

Append a value after every element in PySpark list Dataframe

Answers (3)

Related Questions