Reputation: 363
I have 2 dataframes df1 and df2,
df1
has column Name with values like a,b,c etc
df2
has column Id with values like a,b
If Name column in df1
has a match in Id column in df2
, then we need to have match status as 0. If there is no match then we need to have match status as 1.
I know that I can put df2
ID column in a collection using collect and then check if Name column in df1 has matching entry.
val df1 = Seq(“Rey”, “John”).toDF(“Name”)
val df2 = Seq(“Rey”).toDF(“Id”)
val collect = df2.select("Id").map(r => r.getString(0)).collect.toList
something like,
val df3 =
df1.withColumn("match_sts",when(df1("Name").isin(collect).then(0).else(1)))
Expected output
+ — — + — -+
|Name|match_sts|
+ — — + — -+
| Rey| 0 |
|John| 1 |
+ — — + — -+
But I don't want to use collect here. Is there any alternate approach available.
Upvotes: 4
Views: 11400
Reputation: 18013
With collect is not what you want, but is a well -known issue for DF col --> list conversion. If not a huge list, then you can do - this works actually, you can also broadcast the inlist:
import org.apache.spark.sql.functions._
val df1 = Seq("Rey", "John", "Donald", "Trump").toDF("Name")
val df2 = Seq("Rey", "Donald").toDF("Id")
val inlist = df2.select("Id").map(r => r.getString(0)).collect.toList
val df3 = df1.withColumn("match_status", when(df1("Name").isin(inlist: _*),1).otherwise(0))
df3.show(false)
Even in the classical examples that use the stopwords from a file for filtering output, they do this:
val stopWords = stopWordsInput.flatMap(x => x.split(" ")).map(_.trim).collect.toSet
and broadcast if too big to the Workers. But I am not sure what 1 lakh is!!!
Another approach is with Spark SQL, relying on Catalyst to optimize SQL when EXISTS is used:
import spark.implicits._
import org.apache.spark.sql.functions._
val df1 = Seq("Rey", "John", "Donald", "Trump").toDF("Name")
val df2 = Seq("Rey", "Donald").toDF("Id") // This can be read from file and split etc.
// Optimizer converts to better physical plan for performance in general
df1.createOrReplaceTempView("searchlist")
df2.createOrReplaceTempView("inlist")
val df3 = spark.sql("""SELECT Name, 1
FROM searchlist A
WHERE EXISTS (select B.Id from inlist B WHERE B.Id = A.Name )
UNION
SELECT Name, 0
FROM searchlist A
WHERE NOT EXISTS (select B.Id from inlist B WHERE B.Id = A.Name )
""")
df3.show(false)
Upvotes: 2