Reputation: 6606
Sorry for the noob question, I have a dataframe in SparkSQL like this:
id | name | data
----------------
1 | Mary | ABCD
2 | Joey | DOGE
3 | Lane | POOP
4 | Jack | MEGA
5 | Lynn | ARGH
I want to know how to do two things:
1) use a scala function on one or more columns to produce another column 2) use a scala function on one or more columns to replace a column
Examples:
1) Create a new boolean column that tells whether the data starts with A:
id | name | data | startsWithA
------------------------------
1 | Mary | ABCD | true
2 | Joey | DOGE | false
3 | Lane | POOP | false
4 | Jack | MEGA | false
5 | Lynn | ARGH | true
2) Replace the data column with its lowercase counterpart:
id | name | data
----------------
1 | Mary | abcd
2 | Joey | doge
3 | Lane | poop
4 | Jack | mega
5 | Lynn | argh
What is the best way to do this in SparkSQL? I've seen many examples of how to return a single transformed column, but I don't know how to get back a new DataFrame with all the original columns as well.
Upvotes: 4
Views: 3781
Reputation: 23119
You can use withColumn
to add new column
or to replace the existing column
as
val df = Seq(
(1, "Mary", "ABCD"),
(2, "Joey", "DOGE"),
(3, "Lane", "POOP"),
(4, "Jack", "MEGA"),
(5, "Lynn", "ARGH")
).toDF("id", "name", "data")
val resultDF = df.withColumn("startsWithA", $"data".startsWith("A"))
.withColumn("data", lower($"data"))
If you want separate dataframe then
val resultDF1 = df.withColumn("startsWithA", $"data".startsWith("A"))
val resultDF2 = df.withColumn("data", lower($"data"))
withColumn
replaces the old column
if the same column
name is provided and creates a new column
if new column
name is provided.
Output:
+---+----+----+-----------+
|id |name|data|startsWithA|
+---+----+----+-----------+
|1 |Mary|abcd|true |
|2 |Joey|doge|false |
|3 |Lane|poop|false |
|4 |Jack|mega|false |
|5 |Lynn|argh|true |
+---+----+----+-----------+
Upvotes: 4