Reputation: 89
I have loaded a dataset by using :
Dataset<Row> rows = sparkSession.read().format("com.databricks.spark.csv").option("header", "true").load(tablenameAndLocationMap.get(tablename));
The data is getting loaded correctly but I am looking to update the column value at runtime. I tried using looping as mentioned but it didn't work .
Column data = rows.col("UPLOADED_ON");
Dataset<Row> d = rows.select(data);
d.foreach(obj->{
String date = obj.getAs(0);
DateFormat inputFo formatter = new SimpleDateFormat("yyyy-MM-dd");
Date da = (Date)inputFormatter.parse(date);
DateFormat outputFormatter = new SimpleDateFormat("dd-MM-yy");
date = outputFormatter.format(da);
});
Here I want to replace/update the existing value of column UPLOADED_ON
with the new value date
.
How it can be done , if anyone can help out .
Thanks
Upvotes: 2
Views: 1100
Reputation: 1180
You could create another column with different values and remove the previous one.
// create a new column
yourdataset = yourdataset.withColumn("UPLOADED_ON_NEW", lit("Any-value"));
// drop a column
yourdataset = yourdataset.column("UPLOADED_ON");
In your case I suggest you to create a UDF function that receive a date and return it in the specific format as you need
Example to create a function into sparkSession
to be used in all dataset transformation
context.sparkSession().udf().register(
"formatDateYYYYMMDDtoDDMMYY", // name of function
(String dateIn) -> { ... }, // all convert rules
DataTypes.StringType // return type
);
Using created function
yourdataset =
yourdataset.withColumn(
"UPLOADED_ON_NEW",
callUDF(
"formatDateYYYYMMDDtoDDMMYY", // same name of create function
col("UPLOADED_ON")
)
);
It's possible to use UDF functions in sqlContext as well
yourdataset.createOrReplaceTempView("MY_DATASET");
yourdataset =
sparkSession.sqlContext().sql("select * , formatDateYYYYMMDDtoDDMMYY(UPLOADED_ON) as UPLOADED_ON_NEW from MY_DATASET");
Upvotes: 2