Shebang_John
Shebang_John

Reputation: 121

How to join Spark dataframe without duplicate columns in JAVA

How can we merge 2 dataframes without duplicate columns

a.show()

+-----+-------------------+--------+------+
| Name|           LastTime|Duration|Status|
+-----+-------------------+--------+------+
|  Bob|2015-04-23 12:33:00|       1|logout|
|Alice|2015-04-20 12:33:00|       5| login|
+-----+-------------------+--------+------+

b.show()
+-----+-------------------+--------+------+
| Name|           LastTime|Duration|Status|
+-----+-------------------+--------+------+
|  Bob|2015-04-24 00:33:00|       1|login |
+-----+-------------------+--------+------+

I want to form a new dataframe by using whole data in Dataframe A but update rows using data in B

+-----+-------------------+--------+------+
| Name|           LastTime|Duration|Status|
+-----+-------------------+--------+------+
|  Bob|2015-04-24 00:33:00|       1|login |
|Alice|2015-04-20 12:33:00|       5| login|
+-----+-------------------+--------+------+

I am able to join and form dataframe in scala. But not able to do in JAVA.

DataFrame f=a.join(b,a.col("Name").equalsTo(b.col("Name")).and a.col("LastTime).equalsTo(b.col("LastTime).and(a.col("Duration").equalsTo(b.col("Duration"),"outer")

I am getting duplicate columns while performing JOIN like this.

Upvotes: 4

Views: 4166

Answers (4)

viapak
viapak

Reputation: 91

I think we can try it through Spark SQL and it could be executed through java as well.

spark.sql("""SELECT a.Name as Name,
CASE WHEN b.Name is null THEN a.LastTime ELSE b.LastTime END AS LastTime,
CASE WHEN b.Name is null THEN a.Duration ELSE b.Duration END AS Duration,
CASE WHEN b.Name is null THEN a.Status ELSE b.Status END AS Status 
FROM a a left outer join  b b on a.Name=b.Name 
""").show(false)

+-----+-------------------+--------+------+
|Name |LastTime           |Duration|Status|
+-----+-------------------+--------+------+
|Bob  |2015-04-24 00:33:00|1       |login |
|Alice|2015-04-20 12:33:00|5       |login |
+-----+-------------------+--------+------+

One can update the join condition as per usecase

Upvotes: 1

Renats Razumilovs
Renats Razumilovs

Reputation: 51

Correct way is : (tested)

Dataset<Row> f = a.join(b,
// Convert Java List to Scala Seq
JavaConverters.collectionAsScalaIterableConverter(
    Arrays.asList("Name", "LastTime", "Duration"))
.asScala().toSeq(),
"outer"
)

Upvotes: 1

NiharGht
NiharGht

Reputation: 161

You can perform a left semi join ("leftsemi") to avoid duplicate columns from b dataset.

See here for reference: http://www.techburps.com/misc/apache-spark-dataset-joins-in-java/129

Upvotes: 0

Christoph K&#246;rner
Christoph K&#246;rner

Reputation: 334

According to this SO answer a Seq of column names solve this in Scala.

Hance, converting a Java List to Scala Seq should do it or you. Here would be your corrected example code:

DataFrame f = a.join(b,
    // Convert Java List to Scala Seq
    scala.collection.JavaConverters.asScalaIteratorConverter(
        Arrays.asList("Name", "LastTime", "Duration").iterator()
    ).asScala().toSeq(),
    "outer"
)

Upvotes: 2

Related Questions