Reputation: 25366
I am using the following code to join two data frames:
new_df = df_1.join(df_2, on=['field_A', 'field_B', 'field_C'], how='left_outer')
The above code works fine, but sometimes df_1
and df_2
have hundreds of columns. Is it possible to join using the schema instead of manually adding all the columns? Or is there a way that I can transform the schema into a list? Thanks a lot!
Upvotes: 2
Views: 459
Reputation: 2228
You can't join on schema, if what you meant was somehow having join
incorporate the column dtypes. What you can do is extract the column names out first, then pass them through as the list argument for on=
, like this:
join_cols = df_1.columns
df_1.join(df_2, on=join_cols, how='left_outer')
Now obviously you will have to edit the contents of join_cols
to make sure it only has the names you actually want to join df_1
and df_2
on. But if there are hundreds of valid columns that is probably much faster than adding them one by one. You could also make join_cols an intersection of df_1
and df_2
columns, then edit from there if that's more suitable.
Edit: Although I should add that Spark 2.0 release is literally any day now, and I haven't versed myself on all the changes yet. So that might be worth looking into also, or provide a future solution.
Upvotes: 2