Reputation: 4719
A Spark RDD contains two fields, F1 and F2, and is populated by running a SQL query.
F1 must be unique, while the F2 does not have that constraint. In effect, there is a one to many relationship between F2 and F1. One F2 value could be associated with several F1 values, but not the other way around.
Using Scala, what is the simplest functional programming construct to use against the RDD to check that the data returned from the SQL does not violate this constraint.
Thanks
Upvotes: 1
Views: 3237
Reputation: 2281
If you are going to work on RDD's (not DataFrames) then approach using code snippet below can be handy for you. Let's say your RDD is inputRDD
with 2 fields first would be used as key second as value:
inputRDD.countByKey.filter(_._2 > 1 )
In case of no duplication it should return empty Map()
otherwise Map including rows with duplicated keys (first field)
Upvotes: 1
Reputation: 23109
If this is populated from the sql query than this must be a dataframe, than you can simply validate this by using
df.select("order").distinct().count() == df.count()
If you have converted to rdd than you can use as the @pphilantrovert suggested
df.groupBy(_._1).count == df.count
Note: this is an expensive task if the dataset is large
Hope this helps!
Upvotes: 4