Reputation: 177
I have a DataFrame with Person data and also have like 20 more DataFrames with a common key Person_Id. I want to join all of them to the Person DataFrame to have all my data in the same DataFrame.
I tried both join and merge like this:
merge(df_person, df_1, by="Person_Id", all.x=TRUE)
and
join(df_person, df_1, df_person$Person_Id == df_1$Person_Id, "left")
In both of them, I find the same error. Both functions Join the Datasets in the right way but it duplicates the field Person_Id. Is there any way to tell those functions to not duplicate the Person_Id field?
Also, anyone knows a more efficient way to join all those DataFrames together?
Thanks you so much for your help in advance.
Upvotes: 0
Views: 843
Reputation: 223
If you're doing a lot of joins in SparkR it is worthwhile to make your own function to rename then join then remove the renamed column
DFJoin <- function(left_df, right_df, key = "key", join_type = "left"){
left_df <- withColumnRenamed(left_df, key, "left_key")
right_df <- withColumnRenamed(right_df, key, "right_key")
result <- join(
left_df, right_df,
left_df$left_key == right_df$right_key,
joinType = join_type)
result <- withColumnRenamed(result, "left_key", key)
result$right_key <- NULL
return(result)
}
df1 <- as.DataFrame(data.frame(Person_Id = c("1", "2", "3"), value_1 =
c(2, 4, 6)))
df2 <- as.DataFrame(data.frame(Person_Id = c("1", "2"), value_2 = c(3,
6)))
df3 <- DFjoin(df1, df2, key = "Person_Id", join_type = "left")
head(df3)
Person_Id value_1 value_2
1 3 6 NA
2 1 2 3
3 2 4 6
Upvotes: 1
Reputation: 35249
Other supported languages support simplified equi-join syntax, but it looks like it is not implemented in R so you have to do it the old way (rename and drop):
library(magrittr)
withColumnRenamed(df_1, "Person_Id", "Person_Id_") %>%
join(df_2, column("Person_Id") == column("Person_id_")) %>%
drop("Person_Id_")
Upvotes: 1