Drop multiple duplicated columns after left join w/ dataframes?

Question

I'm working on a script using Pyspark and Spark SQL. However, after running a LEFT-JOIN on my base_df and inc_df dateframes, all of my columns were duplicated.

I've figured out why the columns were duplicated, but I'm now receiving type errors while trying to DROP those duplicated columns.

I've tried this solution as well.... a.join(b, 'id')

Here's part of my code below..


''' uc_df = (base_df

         .join(inc_df,

               (inc_df.src_sys_id == base_df.src_sys_id) & (inc_df.fleet_acct_nbr == base_df.fleet_acct_nbr),

               "left_outer")

         .filter(inc_df.src_sys_id.isNull())

         .drop(base_df.region_cd,

               base_df.fleet_acct_exctv_cd,

               base_df.fleet_acct_nbr,

               base_df.fleet_acct_exctv_nm,

               base_df.fleet_acct_exctv_first_nm,

               base_df.fleet_acct_exctv_last_nm,

               base_df.fleet_acct_exctv_mid_nm,

               base_df.fleet_acct_nm,

               base_df.meas_cnt,

               base_df.dw_anom_flg,

               base_df.dw_mod_ts,

               base_df.src_sys_iud_cd,

               base_df.dw_job_id)

       

        )

I received the following error:

Type Error: Each column in the param list should be a string

Dipanjan Mallick · Accepted Answer

You don't need to drop anything. Rather you can make use of select and aliasing the dataframes as below -

df1.alias("a").join(df2.alias("b"), df1["key"] == df2["key"], 'inner').select("a.*").show()

Replace the values corresponding to yours

Drop multiple duplicated columns after left join w/ dataframes?

Answers (1)

Related Questions