Reputation: 340
I am trying to filter from dataframe based on list of values and I am able to run it the way it is given in example 1. However, when I convert the elements into list and then pass the list into 'isin' function inside the filter function it does not work (shown in example 2).
val df1 = sc.parallelize(Seq((1,"abcd"), (2,"defg"), (3, "ghij"),(4,"xyzz"),(5,"lmnop"),(6,"pqrst"),(7,"wxyz"),(8,"lmnoa"),(9,"jklm"))).toDF("c1","c2")
//example 1:
val df2 = df1.filter(substring(col("c2"), 0, 3).isin("abc","def","ghi"))
//example 2:
val given_list = List("abc","def","ghi")
val df3 = df1.filter(substring(col("c2"), 0, 3).isin(given_list))
The error message while running example 2 is shown below:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon List(abc, def, ghi)
19/10/22 17:03:10 INFO spark.SparkContext: Invoking stop() from shutdown hook
19/10/22 17:03:10 INFO server.AbstractConnector: Stopped Spark@5817c15f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
19/10/22 17:03:10 INFO ui.SparkUI: Stopped Spark web UI at http://192.---.---.---:----
19/10/22 17:03:10 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/10/22 17:03:10 INFO memory.MemoryStore: MemoryStore cleared
19/10/22 17:03:10 INFO storage.BlockManager: BlockManager stopped
19/10/22 17:03:10 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
19/10/22 17:03:10 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/10/22 17:03:10 INFO spark.SparkContext: Successfully stopped SparkContext
19/10/22 17:03:10 INFO util.ShutdownHookManager: Shutdown hook called
19/10/22 17:03:10 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-d17c100d-3e95-4016-a2fd-4e1e02b2449f
Thanks in Advance.
Upvotes: 0
Views: 3066
Reputation: 22439
Method isin
takes an Any*
varargs parameter rather than a collection like List
. You can use the "splat" operator (i.e. _*
) as shown below:
df1.filter(substring(col("c2"), 0, 3).isin(given_list: _*))
Spark 2.4
+ does provide method isInCollection
that takes an Iterable
collection, which can be used as follows:
df1.filter(substring(col("c2"), 0, 3).isInCollection(given_list))
Upvotes: 4