abc_spark
abc_spark

Reputation: 383

column split in Spark Scala dataframe

I have the below Data frame with me -

scala> val df1=Seq(
     | ("1_10","2_20","3_30"),
     | ("7_70","8_80","9_90")
     | )toDF("c1","c2","c3")

scala> df1.show

+----+----+----+
|  c1|  c2|  c3|
+----+----+----+
|1_10|2_20|3_30|
|7_70|8_80|9_90|
+----+----+----+

How to split this to different columns based on delimiter "_".

Expected output -

+----+----+----+----+----+----+
|  c1|  c2|  c3|c1_1|c2_1|c3_1|
+----+----+----+----+----+----+
|1   |2   |3   |  10|  20|  30|
|7   |8   |9   |  70|  80|  90|
+----+----+----+----+----+----+

Also I have 50 + columns in the DF. Thanks in Advance.

Upvotes: 1

Views: 1609

Answers (5)

s.polam
s.polam

Reputation: 10372

Try to use select instead of foldLeft for better performance. As foldLeft might be taking longer time than select

Check this post - foldLeft,select

val expr = df
.columns
.flatMap(c => Seq(
        split(col(c),"_")(0).as(s"${c}_1"),
        split(col(c),"_")(1).as(s"${c}_2")
    )
)
.toSeq

Result

df.select(expr:_*).show(false)
    
+----+----+----+----+----+----+
|c1_1|c1_2|c2_1|c2_2|c3_1|c3_2|
+----+----+----+----+----+----+
|1   |10  |2   |20  |3   |30  |
|7   |70  |8   |80  |9   |90  |
+----+----+----+----+----+----+

Upvotes: 1

Raghu
Raghu

Reputation: 1712

pyspark solution:

import pyspark.sql.functions as F
df1=sqlContext.createDataFrame([("1_10","2_20","3_30"),("7_70","8_80","9_90")]).toDF("c1","c2","c3")
expr = [F.split(coln,"_") for coln in df1.columns]
df2=df1.select(*expr)
#%%
df3= df2.withColumn("clctn",F.flatten(F.array(df2.columns)))
#%%  assuming all columns will have data in the same format x_y
arr_size = len(df1.columns)*2
df_fin= df3.select([F.expr("clctn["+str(x)+"]").alias("c"+str(x/2)+'_'+str(x%2)) for x in range(arr_size)])

Results:

+----+----+----+----+----+----+
|c0_0|c0_1|c1_0|c1_1|c2_0|c2_1|
+----+----+----+----+----+----+
|   1|  10|   2|  20|   3|  30|
|   7|  70|   8|  80|   9|  90|
+----+----+----+----+----+----+

Upvotes: 1

Manoj Kumar Dhakad
Manoj Kumar Dhakad

Reputation: 1892

You can do like this.

var df=Seq(("1_10","2_20","3_30"),("7_70","8_80","9_90")).toDF("c1","c2","c3")

  for (cl <- df.columns) {
    df=df.withColumn(cl+"_temp",split(df.col(cl),"_")(0))
    df=df.withColumn(cl+"_"+cl.substring(1),split(df.col(cl),"_")(1))
    df=df.withColumn(cl,df.col(cl+"_temp")).drop(cl+"_temp")
  }
  df.show(false)
}

//Sample output
    +---+---+---+----+----+----+
    |c1 |c2 |c3 |c1_1|c2_2|c3_3|
    +---+---+---+----+----+----+
    |1  |2  |3  |10  |20  |30  |
    |7  |8  |9  |70  |80  |90  |
    +---+---+---+----+----+----+

Upvotes: 0

koiralo
koiralo

Reputation: 23109

Here is the good use of foldLeft. Split each column and create a new column for each splited value

val cols = df1.columns
  cols.foldLeft(df1) { (acc, name) =>
    acc.withColumn(name, split(col(name), "_"))
      .withColumn(s"${name}_1", col(name).getItem(0))
      .withColumn(s"${name}_2", col(name).getItem(1))
  }.drop(cols:_*)
   .show(false)

If you need the columns name exactly as you want then you need to filter the columns that ends with _1 and rename them again with foldLeft

Output:

+----+----+----+----+----+----+
|c1_1|c1_2|c2_1|c2_2|c3_1|c3_2|
+----+----+----+----+----+----+
|1   |10  |2   |20  |3   |30  |
|7   |70  |8   |80  |9   |90  |
+----+----+----+----+----+----+

Upvotes: 2

milos
milos

Reputation: 271

You can use split method

split(col("c1"), '_')

This will return you ArrayType(StringType) Then you can access items with .getItem(index) method. That is if you have a stable number of elements after spliting if that isnt the case you will have some null values if the indexed value isnt present in the array after splitting.

Example of code:

df.select(
  split(col("c1"), "_").alias("c1_items"),
  split(col("c2"), "_").alias("c2_items"),
  split(col("c3"), "_").alias("c3_items"),
).select(
  col("c1_items").getItem(0).alias("c1"),
  col("c1_items").getItem(1).alias("c1_1"),
  col("c2_items").getItem(0).alias("c2"),
  col("c2_items").getItem(1).alias("c2_1"),
  col("c3_items").getItem(0).alias("c3"),
  col("c3_items").getItem(1).alias("c3_1")
)

Since you need to do this for 50+ columns I would probably suggest to wrap this in a method for a single column + withColumn statement in this kind of way

def splitMyCol(df: Dataset[_], name: String) = {
  df.withColumn(
    s"${name}_items", split(col("name"), "_")
  ).withColumn(
    name, col(s"${name}_items).getItem(0)
  ).withColumn(
    s"${name}_1", col(s"${name}_items).getItem(1)
  ).drop(s"${name}_items")
}

Note I assume you do not need items to be maintained thus I drop it. Also not that due to _ in the name between two variable is s"" string you need to wrap first one in {}, while second doesnt really need {} wrapping and $ is enough.

You can wrap this then in a fold method in this way:

val result = columnsToExpand.foldLeft(df)(
  (acc, next) => splitMyCol(acc, next)
)

Upvotes: 1

Related Questions