Checking that a field in an RDD contains unique values

Question

A Spark RDD contains two fields, F1 and F2, and is populated by running a SQL query.

F1 must be unique, while the F2 does not have that constraint. In effect, there is a one to many relationship between F2 and F1. One F2 value could be associated with several F1 values, but not the other way around.

Using Scala, what is the simplest functional programming construct to use against the RDD to check that the data returned from the SQL does not violate this constraint.

Thanks

koiralo · Accepted Answer

If this is populated from the sql query than this must be a dataframe, than you can simply validate this by using

df.select("order").distinct().count() == df.count()

If you have converted to rdd than you can use as the @pphilantrovert suggested

df.groupBy(_._1).count == df.count

Note: this is an expensive task if the dataset is large

Hope this helps!

Checking that a field in an RDD contains unique values

Answers (2)

Related Questions