Reputation: 25
I'm working on a script using Pyspark and Spark SQL. However, after running a LEFT-JOIN
on my base_df
and inc_df
dateframes, all of my columns were duplicated.
I've figured out why the columns were duplicated, but I'm now receiving type errors
while trying to DROP
those duplicated columns.
I've tried this solution as well.... a.join(b, 'id')
Here's part of my code below..
''' uc_df = (base_df
.join(inc_df,
(inc_df.src_sys_id == base_df.src_sys_id) & (inc_df.fleet_acct_nbr == base_df.fleet_acct_nbr),
"left_outer")
.filter(inc_df.src_sys_id.isNull())
.drop(base_df.region_cd,
base_df.fleet_acct_exctv_cd,
base_df.fleet_acct_nbr,
base_df.fleet_acct_exctv_nm,
base_df.fleet_acct_exctv_first_nm,
base_df.fleet_acct_exctv_last_nm,
base_df.fleet_acct_exctv_mid_nm,
base_df.fleet_acct_nm,
base_df.meas_cnt,
base_df.dw_anom_flg,
base_df.dw_mod_ts,
base_df.src_sys_iud_cd,
base_df.dw_job_id)
)
I received the following error:
Type Error: Each column in the param list should be a string
Upvotes: 1
Views: 356
Reputation: 1739
You don't need to drop anything. Rather you can make use of select
and aliasing the dataframes as below -
df1.alias("a").join(df2.alias("b"), df1["key"] == df2["key"], 'inner').select("a.*").show()
Replace the values corresponding to yours
Upvotes: 1