Reputation: 539
For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks
Edit:
|year| make|model| comment |blank|
|2012|Tesla| S | No comment | |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null | null|
This is my Dataframe I'm trying to change Tesla in make column to S
Upvotes: 38
Views: 115832
Reputation: 301
Building off of the solution from @Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.
import org.apache.spark.sql.functions._
val newsdf =
sdf.withColumn(
"make",
when(col("make") === "Tesla", "S").otherwise(col("make"))
);
Old DataFrame
+-----+-----+
| make|model|
+-----+-----+
|Tesla| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
New Datarame
+-----+-----+
| make|model|
+-----+-----+
| S| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
Upvotes: 30
Reputation: 21
import org.apache.spark.sql.functions._
val base_optin_email = spark.read.option("header","true").option("delimiter",",").schema(schema_base_optin).csv(file_optin_email).where("CPF IS NOT NULL").
withColumn("CARD_KEY", lit(translate( translate(col("cpf"), ".", ""),"-","")))
Upvotes: 0
Reputation: 31
df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()
replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame
For running this function you must have active spark object and dataframe with headers ON.
Upvotes: 3
Reputation: 638
Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:
dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
.otherwise(col("make")
);
Edited to add @marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.
Upvotes: 49
Reputation: 49410
Note:
As mentionned by Olivier Girardot, this answer is not optimized and the withColumn
solution is the one to use (Azeroth2b answer)
Can not delete this answer as it has been accepted
Here is my take on this one:
val rdd = sc.parallelize(
List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
)
val sqlContext = new SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
val dataframe = rdd.toDF()
dataframe.foreach(println)
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
}).collect().foreach(println)
//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]
You can actually use directly map
on the DataFrame
.
So you basically check the column 1 for the String tesla
.
If it's tesla
, use the value S
for make
else you the current value of column 1
Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))
) in my example)
There is probably a better way to do it. I am not that familiar yet with the Spark umbrella
Upvotes: 13
Reputation: 557
This can be achieved in dataframes with user defined functions (udf).
import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
"""{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
)))
val makeSIfTesla = udf {(make: String) =>
if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show
Upvotes: 15