Reputation: 43
I'm using Spark (Scala) to QA data movement - moving tables from one relational database to another. The QA process involves executing a full outer join between the source table, and the target table.
The source table and target tables are joined in a data frame on the key(s):
val joinColumns = for (i <- 0 to (sourceJoinFields.length - 1)) yield sourceDF.col(sourceJoinFields(i)) <=> targetDF.col(targetJoinFields(i))
val joinedDF = sourceDF.join(targetDF, joinColumns.reduce((_&&_)), "fullouter")
I'm using the following logic to find mismatches:
val mismatchColumns = for (i <- 0 to (sourceDF.columns.length-1)) yield (joinedDF.col(joinedDF.columns(i)) =!= joinedDF.col(joinedDF.columns(i+(sourceDF.columns.length))))
val mismatchedDF = joinedDF.filter(mismatchColumns.reduce((_||_)))
However if there is a key missing from one side of the full outer join:
+--------------+--------------+--------------+--------------+
|source_key |source_field |target_key |target_field |
+--------------+--------------+--------------+--------------+
|null |null |XXX |XXX |
will not be in the mismatchedDF data set.
So my question: is the =!=
operator the opposite of the <=>
operator? It does not appear to be, so is there an operator that will return FALSE for this case? I can't find much documentation on either operator.
Upvotes: 4
Views: 5959
Reputation: 330083
The opposite of IS NOT DISTINCT FROM
(<=>
) is IS DISTINCT FROM
(not(... <=> ...)
).
import org.apache.spark.sql.not
val df = Seq(("foo", null), ("foo", "bar"), ("foo", "foo")).toDF("x", "y")
df.select(not($"x" <=> $"y"))
or
df.select(!($"x" <=> $"y"))
or
df.selectExpr("x IS DISTINCT FROM y")
all giving the same result:
+---------------+
|(NOT (x <=> y))|
+---------------+
| true|
| true|
| false|
+---------------+
Of course if you have a disjunction of negations:
(NOT P) OR (NOT Q)
you can always use De Morgan's laws to rewrite it as a negation of concjunctions
NOT(P AND Q)
therefore:
not(joinColumns.foldLeft(lit(true))(_ and _))
should work just fine.
Upvotes: 6