Reputation: 365
suppose I have 2 RDDs
where RDD1 has (key1,key2,value)
and RDD2 has (key1, value)
Now I want to combine operation ( like + or minus ) from RDD2 to RDD1 where key1 has a match here are example
RDD1 has [1,1,3],[1,2,2],[2,2,5]
RDD2 = sc.parallelize([1,1])
I want result
RDD3 to [1,1,4],[1,2,3],[2,2,5] only the first and second data was added while third one wasn't
I try to use left outer join to find match on key1 and do some operation but I will lost the data that don't need to do operation is there a way to do operation in partial data?
Upvotes: 0
Views: 474
Reputation: 330083
Assuming you want pairwise operations or you data contains 1 to 0..1 relationships the simplest thing you can do is to convert both RDDs to DataFrames
:
from pyspark.sql.functions import coalesce, lit
df1 = sc.parallelize([
(1, 1, 3), (1, 2, 2), (2, 2, 5)
]).toDF(("key1", "key2", "value"))
df2 = sc.parallelize([(1, 1)]).toDF(("key1", "value"))
new_value = (
df1["value"] + # Old value
coalesce(df2["value"], lit(0)) # If no match (NULL) take 0
).alias("value") # Set alias
df1.join(df2, ["key1"], "leftouter").select("key1", "key2", new_value)
You can easily adjust this handle other scenarios by applying an aggregation on df2
before joining DataFrames
.
Upvotes: 1