Joining multiple DataFrames using SparkR

Question

I have a DataFrame with Person data and also have like 20 more DataFrames with a common key Person_Id. I want to join all of them to the Person DataFrame to have all my data in the same DataFrame.

I tried both join and merge like this:

merge(df_person, df_1, by="Person_Id", all.x=TRUE)

and

join(df_person, df_1, df_person$Person_Id == df_1$Person_Id, "left")

In both of them, I find the same error. Both functions Join the Datasets in the right way but it duplicates the field Person_Id. Is there any way to tell those functions to not duplicate the Person_Id field?

Also, anyone knows a more efficient way to join all those DataFrames together?

Thanks you so much for your help in advance.

Alper t. Turker · Accepted Answer

Other supported languages support simplified equi-join syntax, but it looks like it is not implemented in R so you have to do it the old way (rename and drop):

library(magrittr)

withColumnRenamed(df_1, "Person_Id", "Person_Id_") %>% 
  join(df_2, column("Person_Id") == column("Person_id_")) %>% 
  drop("Person_Id_")

Joining multiple DataFrames using SparkR

Answers (2)

Related Questions