Spark SQL Map only one column of DataFrame

Question

Sorry for the noob question, I have a dataframe in SparkSQL like this:

id | name | data
----------------
1  | Mary | ABCD
2  | Joey | DOGE
3  | Lane | POOP
4  | Jack | MEGA
5  | Lynn | ARGH

I want to know how to do two things:

1) use a scala function on one or more columns to produce another column 2) use a scala function on one or more columns to replace a column

Examples:

1) Create a new boolean column that tells whether the data starts with A:

id | name | data | startsWithA
------------------------------
1  | Mary | ABCD |        true
2  | Joey | DOGE |       false
3  | Lane | POOP |       false
4  | Jack | MEGA |       false
5  | Lynn | ARGH |        true

2) Replace the data column with its lowercase counterpart:

id | name | data
----------------
1  | Mary | abcd
2  | Joey | doge
3  | Lane | poop
4  | Jack | mega
5  | Lynn | argh

What is the best way to do this in SparkSQL? I've seen many examples of how to return a single transformed column, but I don't know how to get back a new DataFrame with all the original columns as well.

koiralo · Accepted Answer

You can use withColumn to add new column or to replace the existing column as

val df = Seq(
 (1, "Mary", "ABCD"),
 (2, "Joey", "DOGE"),
 (3, "Lane", "POOP"),
 (4, "Jack", "MEGA"),
 (5, "Lynn", "ARGH")
).toDF("id", "name", "data")


val resultDF = df.withColumn("startsWithA", $"data".startsWith("A"))
  .withColumn("data", lower($"data"))

If you want separate dataframe then

val resultDF1 = df.withColumn("startsWithA", $"data".startsWith("A"))
val resultDF2 = df.withColumn("data", lower($"data"))

withColumn replaces the old column if the same column name is provided and creates a new column if new column name is provided. Output:

+---+----+----+-----------+
|id |name|data|startsWithA|
+---+----+----+-----------+
|1  |Mary|abcd|true       |
|2  |Joey|doge|false      |
|3  |Lane|poop|false      |
|4  |Jack|mega|false      |
|5  |Lynn|argh|true       |
+---+----+----+-----------+

Spark SQL Map only one column of DataFrame

Answers (1)

Related Questions