user1052610
user1052610

Reputation: 4719

Checking that a field in an RDD contains unique values

A Spark RDD contains two fields, F1 and F2, and is populated by running a SQL query.

F1 must be unique, while the F2 does not have that constraint. In effect, there is a one to many relationship between F2 and F1. One F2 value could be associated with several F1 values, but not the other way around.

Using Scala, what is the simplest functional programming construct to use against the RDD to check that the data returned from the SQL does not violate this constraint.

Thanks

Upvotes: 1

Views: 3237

Answers (2)

FaigB
FaigB

Reputation: 2281

If you are going to work on RDD's (not DataFrames) then approach using code snippet below can be handy for you. Let's say your RDD is inputRDD with 2 fields first would be used as key second as value:

inputRDD.countByKey.filter(_._2 > 1 )

In case of no duplication it should return empty Map() otherwise Map including rows with duplicated keys (first field)

Upvotes: 1

koiralo
koiralo

Reputation: 23109

If this is populated from the sql query than this must be a dataframe, than you can simply validate this by using

df.select("order").distinct().count() == df.count()

If you have converted to rdd than you can use as the @pphilantrovert suggested

df.groupBy(_._1).count == df.count

Note: this is an expensive task if the dataset is large

Hope this helps!

Upvotes: 4

Related Questions