Gavin Niu
Gavin Niu

Reputation: 1335

Merging rows from different dataframes together in Scala

For example first I have a dataframe like this

+----+-----+-----+--------------------+-----+
|year| make|model|             comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla|    S|          No comment|     |
|1997| Ford| E350|Go get one now th...|     |
|2015|Chevy| Volt|                null| null|
+----+-----+-----+--------------------+-----+

we have years 2012, 1997 and 2015. And we have another Dataframe like this

+----+-----+-----+--------------------+-----+
|year| make|model|             comment|blank|
+----+-----+-----+--------------------+-----+
|2012|BMW  |    3|          No comment|     |
|1997|VW   | GTI |   get              |     |
|2015|MB   | C200|                good| null|
+----+-----+-----+--------------------+-----+

we also have year 2012, 1997, 2015. How can we merge the rows with same year together? Thanks

The output should be like this

+----+-----+-----+--------------------+-----++-----+-----+--------------------------+
|year| make|model|             comment|blank|| make|model|             comment|blank|
+----+-----+-----+--------------------+-----++-----+-----+-----+--------------------+
|2012|Tesla|    S|          No comment|     |BMW   | 3   |          no comment|
|1997| Ford| E350|Go get one now th...|     |VW    |GTI  |      get           |
|2015|Chevy| Volt|                null| null|MB    |C200 |             Good   |null
+----+-----+-----+--------------------+-----++----+-----+-----+---------------------+

Upvotes: 0

Views: 658

Answers (1)

Rohan Aletty
Rohan Aletty

Reputation: 2432

You can get what your desired table with a simple join. Something like:

val joined = df1.join(df2, df1("year") === df2("year"))

I loaded your inputs such that I see the following:

scala> df1.show
...
year make  model comment
2012 Tesla S     No comment
1997 Ford  E350  Go get one now
2015 Chevy Volt  null

scala> df2.show
...
year make model comment
2012 BMW  3     No comment
1997 VW   GTI   get
2015 MB   C200  good

When I run the join, I get:

scala> val joined = df1.join(df2, df1("year") === df2("year"))
joined: org.apache.spark.sql.DataFrame = [year: string, make: string, model: string, comment: string, year: string, make: string, model: string, comment: string]

scala> joined.show
...
year make  model comment        year make model comment
2012 Tesla S     No comment     2012 BMW  3     No comment
2015 Chevy Volt  null           2015 MB   C200  good
1997 Ford  E350  Go get one now 1997 VW   GTI   get

One thing to note is that your column names may be ambiguous as they are named the same across dataframes (so you could change their names to make operations on your resultant dataframe easier to write).

Upvotes: 1

Related Questions