Reputation: 121
How can we merge 2 dataframes without duplicate columns
a.show()
+-----+-------------------+--------+------+
| Name| LastTime|Duration|Status|
+-----+-------------------+--------+------+
| Bob|2015-04-23 12:33:00| 1|logout|
|Alice|2015-04-20 12:33:00| 5| login|
+-----+-------------------+--------+------+
b.show()
+-----+-------------------+--------+------+
| Name| LastTime|Duration|Status|
+-----+-------------------+--------+------+
| Bob|2015-04-24 00:33:00| 1|login |
+-----+-------------------+--------+------+
I want to form a new dataframe by using whole data in Dataframe A but update rows using data in B
+-----+-------------------+--------+------+
| Name| LastTime|Duration|Status|
+-----+-------------------+--------+------+
| Bob|2015-04-24 00:33:00| 1|login |
|Alice|2015-04-20 12:33:00| 5| login|
+-----+-------------------+--------+------+
I am able to join and form dataframe in scala. But not able to do in JAVA.
DataFrame f=a.join(b,a.col("Name").equalsTo(b.col("Name")).and a.col("LastTime).equalsTo(b.col("LastTime).and(a.col("Duration").equalsTo(b.col("Duration"),"outer")
I am getting duplicate columns while performing JOIN like this.
Upvotes: 4
Views: 4166
Reputation: 91
I think we can try it through Spark SQL and it could be executed through java as well.
spark.sql("""SELECT a.Name as Name,
CASE WHEN b.Name is null THEN a.LastTime ELSE b.LastTime END AS LastTime,
CASE WHEN b.Name is null THEN a.Duration ELSE b.Duration END AS Duration,
CASE WHEN b.Name is null THEN a.Status ELSE b.Status END AS Status
FROM a a left outer join b b on a.Name=b.Name
""").show(false)
+-----+-------------------+--------+------+
|Name |LastTime |Duration|Status|
+-----+-------------------+--------+------+
|Bob |2015-04-24 00:33:00|1 |login |
|Alice|2015-04-20 12:33:00|5 |login |
+-----+-------------------+--------+------+
One can update the join condition as per usecase
Upvotes: 1
Reputation: 51
Correct way is : (tested)
Dataset<Row> f = a.join(b,
// Convert Java List to Scala Seq
JavaConverters.collectionAsScalaIterableConverter(
Arrays.asList("Name", "LastTime", "Duration"))
.asScala().toSeq(),
"outer"
)
Upvotes: 1
Reputation: 161
You can perform a left semi join ("leftsemi") to avoid duplicate columns from b dataset.
See here for reference: http://www.techburps.com/misc/apache-spark-dataset-joins-in-java/129
Upvotes: 0
Reputation: 334
According to this SO answer a Seq of column names solve this in Scala.
Hance, converting a Java List to Scala Seq should do it or you. Here would be your corrected example code:
DataFrame f = a.join(b,
// Convert Java List to Scala Seq
scala.collection.JavaConverters.asScalaIteratorConverter(
Arrays.asList("Name", "LastTime", "Duration").iterator()
).asScala().toSeq(),
"outer"
)
Upvotes: 2