nee nee
nee nee

Reputation: 89

How to update the value of a column in Spark dataset using java?

I have loaded a dataset by using :

Dataset<Row> rows = sparkSession.read().format("com.databricks.spark.csv").option("header", "true").load(tablenameAndLocationMap.get(tablename));

The data is getting loaded correctly but I am looking to update the column value at runtime. I tried using looping as mentioned but it didn't work .

Column data = rows.col("UPLOADED_ON");
Dataset<Row> d = rows.select(data);
            
d.foreach(obj->{
    String date = obj.getAs(0);
    DateFormat inputFo  formatter = new SimpleDateFormat("yyyy-MM-dd");
    Date da = (Date)inputFormatter.parse(date);
    
    DateFormat outputFormatter = new SimpleDateFormat("dd-MM-yy");
    date = outputFormatter.format(da);
});

Here I want to replace/update the existing value of column UPLOADED_ON with the new value date.

How it can be done , if anyone can help out .

Thanks

Upvotes: 2

Views: 1100

Answers (1)

Dilermando Lima
Dilermando Lima

Reputation: 1180

You could create another column with different values and remove the previous one.

// create a new column
yourdataset = yourdataset.withColumn("UPLOADED_ON_NEW", lit("Any-value"));
// drop a column 
yourdataset = yourdataset.column("UPLOADED_ON");

In your case I suggest you to create a UDF function that receive a date and return it in the specific format as you need

Example to create a function into sparkSession to be used in all dataset transformation

context.sparkSession().udf().register(
   "formatDateYYYYMMDDtoDDMMYY", // name of function
   (String dateIn) -> { ... }, // all convert rules
   DataTypes.StringType // return type
);

Using created function

yourdataset = 
yourdataset.withColumn(
  "UPLOADED_ON_NEW", 
  callUDF(
     "formatDateYYYYMMDDtoDDMMYY", // same name of create function
     col("UPLOADED_ON")
  )
);

It's possible to use UDF functions in sqlContext as well

yourdataset.createOrReplaceTempView("MY_DATASET");

yourdataset = 
sparkSession.sqlContext().sql("select * , formatDateYYYYMMDDtoDDMMYY(UPLOADED_ON) as UPLOADED_ON_NEW  from MY_DATASET");

Upvotes: 2

Related Questions