pyspark: join using schema? Or converting the schema to a list?

Question

I am using the following code to join two data frames:

new_df = df_1.join(df_2, on=['field_A', 'field_B', 'field_C'], how='left_outer')

The above code works fine, but sometimes df_1 and df_2 have hundreds of columns. Is it possible to join using the schema instead of manually adding all the columns? Or is there a way that I can transform the schema into a list? Thanks a lot!

Jeff · Accepted Answer

You can't join on schema, if what you meant was somehow having join incorporate the column dtypes. What you can do is extract the column names out first, then pass them through as the list argument for on=, like this:

join_cols = df_1.columns
df_1.join(df_2, on=join_cols, how='left_outer')

Now obviously you will have to edit the contents of join_cols to make sure it only has the names you actually want to join df_1 and df_2 on. But if there are hundreds of valid columns that is probably much faster than adding them one by one. You could also make join_cols an intersection of df_1 and df_2 columns, then edit from there if that's more suitable.

Edit: Although I should add that Spark 2.0 release is literally any day now, and I haven't versed myself on all the changes yet. So that might be worth looking into also, or provide a future solution.

pyspark: join using schema? Or converting the schema to a list?

Answers (1)

Related Questions