TheRealJimShady
TheRealJimShady

Reputation: 4293

create substring column in spark dataframe

I want to take a json file and map it so that one of the columns is a substring of another. For example to take the left table and produce the right table:

 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  hello  |

I can do this using spark-sql syntax but how can it be done using the in-built functions?

Upvotes: 18

Views: 102136

Answers (6)

Saltanat Khalyk
Saltanat Khalyk

Reputation: 79

if you want to get substring from the beginning of string then count their index from 0, where letter 'h' has 7th and letter 'o' has 11th index:

from pyspark.sql.functions import substring
df = df.withColumn('b', col('a').substr(7, 11))

if you want to get last 5 strings and word 'hello' with length equal to 5 in a column, then use:

df = df.withColumn('b', col('a').substr(-5,5))

Upvotes: 1

KeepLearning
KeepLearning

Reputation: 525

You can just do it by using the pyspark way, like in the following example:

df.withColumn('New_col', df['Old_col'].substr(0, 7)).show()

Upvotes: 4

Ignacio Alorre
Ignacio Alorre

Reputation: 7605

Just to enrich existing answers. In case you were interested in the right part of the string column. That is:

 ------------              ------------------------
|     a      |             |      a     |    b    |
|------------|       ->    |------------|---------|
|hello, world|             |hello, world|  world  |

You should use a negative index:

dataFrame.select(col("a"), substring_index(col("a"), ",", -1).as("b"))

Upvotes: 6

Balázs Fehér
Balázs Fehér

Reputation: 344

Suppose you have the following dataframe:

import spark.implicits._
import org.apache.spark.sql.functions._

var df = sc.parallelize(Seq(("foobar", "foo"))).toDF("a", "b")

+------+---+
|     a|  b|
+------+---+
|foobar|foo|
+------+---+

You could subset a new column from the first column as follows:

df = df.select(col("*"), substring(col("a"), 4, 6).as("c"))

+------+---+---+
|     a|  b|  c|
+------+---+---+
|foobar|foo|bar|
+------+---+---+

Upvotes: 12

pasha701
pasha701

Reputation: 7207

Such statement can be used

import org.apache.spark.sql.functions._

dataFrame.select(col("a"), substring_index(col("a"), ",", 1).as("b"))

Upvotes: 27

soote
soote

Reputation: 3260

You would use the withColumn function

import org.apache.spark.sql.functions.{ udf, col }
def substringFn(str: String) = your substring code
val substring = udf(substringFn _)
dataframe.withColumn("b", substring(col("a"))

Upvotes: 6

Related Questions