Reputation: 39
In scala/spark code I have 1 Dataframe which contains some rows:
col1 col2
Abc someValue1
xyz someValue2
lmn someValue3
zmn someValue4
pqr someValue5
cda someValue6
And i have a variable of ArrayBuffer[String]
which contains [xyz,pqr,abc]
;
I want to filter given dataframe based on given values in arraybuffer at col1.
In SQL it would be like:
select * from tableXyz where col1 in("xyz","pqr","abc");
Upvotes: 2
Views: 3453
Reputation: 1525
Assuming you have your dataframe:
val df = sc.parallelize(Seq(("abc","someValue1"),
("xyz","someValue2"),
("lmn","someValue3"),
("zmn","someValue4"),
("pqr","someValue5"),
("cda","someValue6")))
.toDF("col1","col2")
+----+----------+
|col1| col2|
+----+----------+
| abc|someValue1|
| xyz|someValue2|
| lmn|someValue3|
| zmn|someValue4|
| pqr|someValue5|
| cda|someValue6|
+----+----------+
Then you can define an UDF to filter the dataframe based on array's values:
val array = ArrayBuffer[String]("xyz","pqr","abc")
val function: (String => Boolean) = (arg: String) => array.contains(arg)
val udfFiltering = udf(function)
val filtered = df.filter(udfFiltering(col("col1")))
filtered.show()
+----+----------+
|col1| col2|
+----+----------+
| abc|someValue1|
| xyz|someValue2|
| pqr|someValue5|
+----+----------+
Alternately you can register your dataframe and sql-query it by SQLContext:
var elements = ""
array.foreach { el => elements += "\"" + el + "\"" + "," }
elements = elements.dropRight(1)
val query = "select * from tableXyz where col1 in(" + elements + ")"
df.registerTempTable("tableXyz")
val filtered = sqlContext.sql(query)
filtered.show()
+----+----------+
|col1| col2|
+----+----------+
| abc|someValue1|
| xyz|someValue2|
| pqr|someValue5|
+----+----------+
Upvotes: 1