Gabber
Gabber

Reputation: 5452

Drop list of Column from a single dataframe in spark

I have a Dataframe resulting from a join of two Dataframes: df1 and df2 into df3. All the columns found in df2 are also in df1, but their contents differ. I'd like to remove all the df1 columns which names are in df2.columns from the join. Would there be a way to do this without using a var? Currently I've done this

var ret = df3
df2.columns.foreach(coln => ret = ret.drop(df2(coln)))

but what I really want is just a shortcut for

df3.drop(df1(df2.columns(1))).drop(df1(df2.columns(2)))....

without using a var.

Passing a list of columns is not an option, don't know if it's because I'm using spark 2.2

EDIT:

Important note: I don't know in advance the columns of df1 and df2

Upvotes: 0

Views: 382

Answers (2)

Raphael Roth
Raphael Roth

Reputation: 27373

A shortcut would be:

val ret  = df2.columns.foldLeft(df3)((acc,coln) => acc.drop(df2(coln)))

I would suggest to remove the columns before the join. Alternatively, select only the columns from df3 which come from df2:

val ret = df3.select(df2.columns.map(col):_*)

Upvotes: 1

Subhasish Guha
Subhasish Guha

Reputation: 232

This is possible to achieve while you are performing the join itself. Please try the below code

 val resultDf=df1.alias("frstdf").join(broadcast(df2).alias("scndf"),  $"frstdf.col1" === $"scndf.col1", "left_outer").selectExpr("scndf.col1","scndf.col2"...)//.selectExpr("scndf.*")

This would only contain the columns from the second data frame. Hope this helps

Upvotes: 3

Related Questions