replace one column values with another Spark Java

Question

I have a dataframe df1 of the format

+------+------+------+
| Col1 | Col2 | Col3 |
+------+------+------+
| A    | z    | m    |
| B    | w    | n    |
| C    | x    | o    |
| A    | z    | n    |
| A    | p    | o    |
+------+------+------+

and another dataframe df2 of the format

+------+------+
| Col1 | Col2 |
+------+------+
| 0-A  | 0-z  |
| 1-B  | 3-w  |
| 2-C  | 1-x  |
|      | 2-P  |
+------+------+-

I am trying to replace the values in Col1 and Col2 of df1 with values from df2 using Spark Java.

The end dataframe df3 should look like this.

+------+------+------+
| Col1 | Col2 | Col3 |
+------+------+------+
| 0-A  | 0-z  | m    |
| 1-B  | 3-w  | n    |
| 2-C  | 1-x  | o    |
| 0-A  | 0-z  | n    | 
| 0-A  | 2-p  | o    |
+------+------+------+

I am trying to replace all the values in the column1 and column2 of df1 with values from col1 and col2 of df2. Is there anyway that i can achieve this in Spark Java dataframe syntax.?

The initial idea i had was to do the following.

String pattern1="\p{L}+(?: \p{L}+)*$";

df1=df1.join(df2, df1.col("col1").equalTo(regexp_extract(df2.col("col1"),pattern1,1)),"left-semi");

Nikhil Suthar · Accepted Answer

Replace your last join operation with below join.

df1.alias("x").join(df2.alias("y").select(col("y.Col1").alias("newCol1")), col("x.Col1") === regexp_extract(col("newCol1"),"\p{L}+(?: \p{L}+)*$",0), "left")
              .withColumn("Col1", col("newCol1"))
              .join(df2.alias("z").select(col("z.Col2").alias("newCol2")), col("x.Col2") === regexp_extract(col("newCol2"),"\p{L}+(?: \p{L}+)*$",0), "left")
              .withColumn("Col2", col("newCol2"))
              .drop("newCol1", "newCol2")
              .show(false)

+----+----+----+
|Col1|Col2|Col3|
+----+----+----+
|2-C |1-x |o   |
|0-A |0-z |m   |
|0-A |0-z |n   |
|0-A |2-p |o   |
|1-B |3-w |n   |
+----+----+----+

replace one column values with another Spark Java

Answers (1)

Related Questions