How to use NOT IN from a CSV file in Spark

Question

I use Spark sql to load data into a val like this

val customers = sqlContext.sql("SELECT * FROM customers")

But I have a separate txt file that contains one column CUST_ID and 50,00 rows. i.e.

CUST_ID
1
2
3

I want my customers val to have all customers in customers table that are not in the TXT file.

Using Sql I would do this by SELECT * FROM customers NOT IN cust_id ('1','2','3')

How can I do this using Spark?

I've read the textFile and I can print rows of it but I'm not sure how to match this with my sql query

scala> val custids = sc.textFile("cust_ids.txt")
scala> custids.take(4).foreach(println)
CUST_ID
1
2
3

ulrich · Accepted Answer

You can import your text file as a dataframe and do a left outer join:

val customers = Seq(("1", "AAA", "shipped"), ("2", "ADA", "delivered") , ("3", "FGA", "never received")).toDF("id","name","status")
val custId = Seq(1,2).toDF("custId")

customers.join(custId,'id === 'custId,"leftOuter")
         .where('custId.isNull)
         .drop("custId")
         .show()


+---+----+--------------+
| id|name|        status|
+---+----+--------------+
|  3| FGA|never received|
+---+----+--------------+

How to use NOT IN from a CSV file in Spark

Answers (1)

Related Questions