Reputation: 4375
For example I have a dataframe like below,
df
DataFrame[columnA: int, columnB: int]
If I have to do two checks. I am going over the data two times like below,
df.where(df.columnA == 412).count()
df.where(df.columnB == 25).count()
In normal code I would be having two count variable and incrementing on True. How would I go with spark dataframe? Appreciate if some one could link to right documentation also. Happy to see python or scala.
Upvotes: 1
Views: 1149
Reputation: 1665
@zero323's answer is spot on, but just to indicate the most flexible programming model is Spark, you could do your checks as if
statements inside a map
with a lambda function, e.g. (using the same dataframe as above)
import org.apache.spark.sql.functions._
val r1 = df.map(x => {
var x0 = 0
var x1 = 0
if (x(0) == 412) x0=1
if (x(1) == 25) x1=1
(x0, x1)
}).toDF("x0", "x1").select(sum("x0"), sum("x1")).show()
This model lets you do almost anything you can think of, though you're much better off sticking with the specific APIs where available.
Upvotes: 2
Reputation: 330093
For example like this:
import org.apache.spark.sql.functions.sum
val df = sc.parallelize(Seq(
(412, 0),
(0, 25),
(412, 25),
(0, 25)
)).toDF("columnA", "columnB")
df.agg(
sum(($"columnA" === 412).cast("long")).alias("columnA"),
sum(($"columnB" === 25).cast("long")).alias("columnB")
).show
// +-------+-------+
// |columnA|columnB|
// +-------+-------+
// | 2| 3|
// +-------+-------+
or like this:
import org.apache.spark.sql.functions.{count, when}
df.agg(
count(when($"columnA" === 412, $"columnA")).alias("columnA"),
count(when($"columnB" === 25, $"columnB")).alias("columnB")
).show
// +-------+-------+
// |columnA|columnB|
// +-------+-------+
// | 2| 3|
// +-------+-------+
I am not aware of any specific documentation but I am pretty sure you'll find this in any good SQL reference.
Upvotes: 2