Reputation: 4293
I want to take a json file and map it so that one of the columns is a substring of another. For example to take the left table and produce the right table:
------------ ------------------------
| a | | a | b |
|------------| -> |------------|---------|
|hello, world| |hello, world| hello |
I can do this using spark-sql syntax but how can it be done using the in-built functions?
Upvotes: 18
Views: 102136
Reputation: 79
if you want to get substring from the beginning of string then count their index from 0, where letter 'h' has 7th and letter 'o' has 11th index:
from pyspark.sql.functions import substring
df = df.withColumn('b', col('a').substr(7, 11))
if you want to get last 5 strings and word 'hello' with length equal to 5 in a column, then use:
df = df.withColumn('b', col('a').substr(-5,5))
Upvotes: 1
Reputation: 525
You can just do it by using the pyspark
way, like in the following example:
df.withColumn('New_col', df['Old_col'].substr(0, 7)).show()
Upvotes: 4
Reputation: 7605
Just to enrich existing answers. In case you were interested in the right part of the string column. That is:
------------ ------------------------
| a | | a | b |
|------------| -> |------------|---------|
|hello, world| |hello, world| world |
You should use a negative index:
dataFrame.select(col("a"), substring_index(col("a"), ",", -1).as("b"))
Upvotes: 6
Reputation: 344
Suppose you have the following dataframe:
import spark.implicits._
import org.apache.spark.sql.functions._
var df = sc.parallelize(Seq(("foobar", "foo"))).toDF("a", "b")
+------+---+
| a| b|
+------+---+
|foobar|foo|
+------+---+
You could subset a new column from the first column as follows:
df = df.select(col("*"), substring(col("a"), 4, 6).as("c"))
+------+---+---+
| a| b| c|
+------+---+---+
|foobar|foo|bar|
+------+---+---+
Upvotes: 12
Reputation: 7207
Such statement can be used
import org.apache.spark.sql.functions._
dataFrame.select(col("a"), substring_index(col("a"), ",", 1).as("b"))
Upvotes: 27
Reputation: 3260
You would use the withColumn
function
import org.apache.spark.sql.functions.{ udf, col }
def substringFn(str: String) = your substring code
val substring = udf(substringFn _)
dataframe.withColumn("b", substring(col("a"))
Upvotes: 6