Unable to understand UDFs in Spark and especially in Java

Question

I am trying to create a new column in Spark Datasets based on another column's value. The other column's value is searched in a json file as a key and the returned its value which is the value to be used for new column.

Here is then code that I tried but it doesn't work and I am not sure how UDF's work as well. How do you add a column in this case using withColumn or udf?

Dataset df = spark.read().format("csv").option("header", "true").load("file path");
        Object obj = new JSONParser().parse(new FileReader("json path"));
        JSONObject jo = (JSONObject) obj;

        df = df.withColumn("cluster", functions.lit(jo.get(df.col("existing col_name")))));

Any help will be appreciated. Thanks in advance!

Atihska · Accepted Answer

Thanks @Constantine. I was able to better understand UDFs from your example. Here is my java code:

        Object obj = new JSONParser().parse(new FileReader("json path"));
        JSONObject jo = (JSONObject) obj;

        spark.udf().register("getJsonVal", new UDF1() {
            @Override
            public String call(String key) {
                return  (String) jo.get(key.substring(0, 5));
            }
        }, DataTypes.StringType);

        df = df.withColumn("cluster", functions.callUDF("getJsonVal", df.col("existing col_name")));
        df.show(); // SHOWS NEW CLUSTER COLUMN

Unable to understand UDFs in Spark and especially in Java

Answers (2)

Related Questions