Padfoot123
Padfoot123

Reputation: 1117

Java Spark - Difference of two column values in dataset/dataframe

I have the below spark dataset/dataframe. I have to create a new column diff_col by finding the difference between absolute values of col_2 and col_3

col_1 col_2 col_3
A     5      3
B     null  -2
C     2     null
D     null  null
E     3      1
F     4     -2

Expected output:

col_1 col_2 col_3 diff_col
A     5      3     2
B     null  -2    -2
C     2     null   2
D     null  null   0
E     3      1     2
F     4     -2     2

Not sure how to do this in java spark. In pyspark we can do by replacing null with lit(0) and then col(col_2) - col(col_3). Is there a java spark equivalent for this?

Upvotes: 0

Views: 1287

Answers (3)

andreoss
andreoss

Reputation: 1800

You can use coalese [1] and lit [2] since you have only two columns.

    private Column absoluteBetweenTwo(final Dataset<Row> ds) {
        final Column col_1 = functions.coalesce(ds.col("col_1"), functions.lit(0));
        final Column col_2 = functions.coalesce(ds.col("col_2"), functions.lit(0));
        return functions.abs(col_1.minus(col_2));
    }

And than add it as column

    ds.withColumn("col_3", this.absoluteBetweenTwo(ds));

The alternative solution is to wrap your Dataset explictitly in DataFrameNaFunctions [3], and use fill otherwise not acceptable from Java API.

fill will replace all NULL with provided value for columns of corresponding type.

    return new DataFrameNaFunctions(ds).fill(0L);

1: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#coalesce-org.apache.spark.sql.Column...-

2: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#lit-java.lang.Object-

3: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html

Upvotes: 1

s.polam
s.polam

Reputation: 10362

Use abs,nvl SQL functions inside expr function.

df
.withColumn(
    "diff_col",
    expr("(abs(nvl(col_2,0)) - abs(nvl(col_3,0)))")
)
.show(false)

Output

+-----+-----+-----+--------+
|col_1|col_2|col_3|diff_col|
+-----+-----+-----+--------+
|A    |5    |3    |2       |
|B    |null |-2   |-2      |
|C    |2    |null |2       |
|D    |null |null |0       |
|E    |3    |1    |2       |
|F    |4    |-2   |2       |
+-----+-----+-----+--------+

Upvotes: 0

Amita
Amita

Reputation: 974

You can try using fill method that creates another dataset replacing all null values with 0 (or any other numeric values) for specified columns. Then you can continue with calculating the difference.

Upvotes: 0

Related Questions