Reputation: 640
I decided to create this question because the answer provided to a very similar problem does not work for my scenario:
I want to merge these two SparkR dataframes df1
col1 col2
1 11
2 22
3 33
4 44
and df2
col1 col2 newcol
1 11 401
2 22 402
3 33 403
4 44 404
The dfs should be merged on columns 1 and 2. I tried all steps proposed in the link above (e.g. omitting "by" statement), but the result always looks like this:
col1_x col2_x col1_y col2_y newcol
1 11 1 11 401
2 22 2 22 402
3 33 3 33 403
4 44 4 44 404
I don't want to have duplicate columns and for now I have to delete col1_y, col2_y manually and rename col1_x, col2_x to only have col1, col2 and newcol as result:
df <- merge(df1, df2) # <- how can I change this?
df$col1_y <- NULL
df$col2_y <- NULL
df <- withColumnRenamed(df, "col1_x", "col1")
df <- withColumnRenamed(df, "col2_x", "col2")
I tried all kinds of combinations using the by, by.x and by.y statement but none worked. Is there a way I can simplify this tedious correction chain by omitting one of the steps? Thank you!
Upvotes: 3
Views: 928
Reputation: 24188
This is expected behaviour, one option is to select()
relevant columns from the resulting DataFrame
:
library(magrittr)
# We'll use join to avoid suffixes '_x' & '_y'
join(df1,df2,(df1$col1==df2$col1)&(df1$col2==df2$col2)) %>%
select(df1$col1,df1$col2,df2$newcol) %>%
showDF()
#+----+----+------+
#|col1|col2|newcol|
#+----+----+------+
#| 1|11.0| 401|
#| 2|22.0| 402|
#| 3|33.0| 403|
#| 4|55.0| 404|
#+----+----+------+
Upvotes: 1