Reputation: 3256
I am very new to Spark and Scala, and I want to test if a value is a key from an RDD.
The data I have is like this:
RDD data: key -> value
RDD stat: key -> statistics
What I want to do is to filter all the key-value pairs in data that has the key in stat.
My general idea is to convert the keys of an RDD into a set, then test if a value belongs to this set?
Are there better approaches, and how to convert the keys of an RDD into a set using Scala?
Thanks.
Upvotes: 1
Views: 3144
Reputation: 11751
You can use lookup
def lookup(key: K): List[V]
Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.
You asked -
What I want to do is to filter all the key-value pairs in data that has the key in stat.
I think you should join
by key instead of doing a lookup
.
join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
.
"close over an RDD inside another RDD."
Basically using an RDD inside the transformations (in this case filter
) of another RDD.
Nesting of one RDD inside another is not allowed in Spark.
Upvotes: 2