Spark Deduplicate column in dataframe based on column in other dataframe

Question

I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:

df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]

df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]

Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way. Any ideas?

David Griffin · Accepted Answer

The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:

val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")

df1.as("df1").join(
  df2.as("df2"),
  col("df1.uniq_id") === col("df2.uniq_id"),
  "left_outer"
).filter($"df2.uniq_id".isNull)

Spark Deduplicate column in dataframe based on column in other dataframe

Answers (1)

Related Questions