GreenBlazerrr21
GreenBlazerrr21

Reputation: 25

Drop multiple duplicated columns after left join w/ dataframes?

I'm working on a script using Pyspark and Spark SQL. However, after running a LEFT-JOIN on my base_df and inc_df dateframes, all of my columns were duplicated.

I've figured out why the columns were duplicated, but I'm now receiving type errors while trying to DROP those duplicated columns.

I've tried this solution as well.... a.join(b, 'id')

Here's part of my code below..


''' uc_df = (base_df

         .join(inc_df,

               (inc_df.src_sys_id == base_df.src_sys_id) & (inc_df.fleet_acct_nbr == base_df.fleet_acct_nbr),

               "left_outer")

         .filter(inc_df.src_sys_id.isNull())

         .drop(base_df.region_cd,

               base_df.fleet_acct_exctv_cd,

               base_df.fleet_acct_nbr,

               base_df.fleet_acct_exctv_nm,

               base_df.fleet_acct_exctv_first_nm,

               base_df.fleet_acct_exctv_last_nm,

               base_df.fleet_acct_exctv_mid_nm,

               base_df.fleet_acct_nm,

               base_df.meas_cnt,

               base_df.dw_anom_flg,

               base_df.dw_mod_ts,

               base_df.src_sys_iud_cd,

               base_df.dw_job_id)

       

        )

I received the following error:

Type Error: Each column in the param list should be a string

Upvotes: 1

Views: 356

Answers (1)

Dipanjan Mallick
Dipanjan Mallick

Reputation: 1739

You don't need to drop anything. Rather you can make use of select and aliasing the dataframes as below -

df1.alias("a").join(df2.alias("b"), df1["key"] == df2["key"], 'inner').select("a.*").show()

Replace the values corresponding to yours

Upvotes: 1

Related Questions