Why are the results obtained by using the spark's QuantileDiscretizer grouped unevenly?

Question

I have a Dataset.

The feature columns are grouped using the org.apache.spark.ml.feature.QuantileDiscretizer class of spark 2.3.1, and the resulting model grouping results are not uniform.

The data reflected in the last packet is almost twice as much as the other packets, and I set 11 packets in the parameter, but only 10 packets are actually obtained.

Look at the program below.

import org.apache.spark.ml.feature.QuantileDiscretizer
import org.apache.spark.ml.feature.Bucketizer
val model = new QuantileDiscretizer()
    .setInputCol("features")
    .setOutputCol("level")
    .setNumBuckets(11)
    .setHandleInvalid("keep")
    .fit(df)
println(model.getSplits.mkString(", "))
model
    .transform(df)
    .groupBy("level")
    .count
    .orderBy("level")
    .show

The output:

-Infinity, 115.0, 280.25, 479.75, 712.5, 1000.0, 1383.37, 1892.75, 2690.93, 4305.0, Infinity
+-----+------+                                                                  
|level| count|
+-----+------+
| null|  9113|
|  0.0| 55477|
|  1.0| 52725|
|  2.0| 54657|
|  3.0| 53592|
|  4.0| 54165|
|  5.0| 54732|
|  6.0| 52915|
|  7.0| 54090|
|  8.0| 53393|
|  9.0|107369|
+-----+------+

Grouping the last set of data separately:

val df1 = df.where("features >= 4305.0")
val model1 = new QuantileDiscretizer()
    .setInputCol("features")
    .setOutputCol("level")
    .setNumBuckets(2)
    .setHandleInvalid("keep")
    .fit(df1)

println(model1.getSplits.mkString(", "))
model1
    .transform(df1)
    .groupBy("level")
    .count
    .orderBy("level")
    .show

The output:

-Infinity, 20546.12, Infinity
+-----+-----+                                                                   
|level|count|
+-----+-----+
|  0.0|53832|
|  1.0|53537|
+-----+-----+

If I manually specify the grouper boundaries to group:

val splits = Array(Double.NegativeInfinity, 
    115.0, 280.25, 479.75, 712.5, 1000.0, 1383.37, 1892.75, 2690.93, 4305.0, 
    20546.12, Double.PositiveInfinity)
val model = new Bucketizer()
    .setInputCol("features")
    .setOutputCol("level")
    .setHandleInvalid("keep")
    .setSplits(splits)
model
.transform(df)
.groupBy("level")
.count
.orderBy("level")
.show

The output:

+-----+-----+                                                                   
|level|count|
+-----+-----+
| null| 9113|
|  0.0|55477|
|  1.0|52725|
|  2.0|54657|
|  3.0|53592|
|  4.0|54165|
|  5.0|54732|
|  6.0|52915|
|  7.0|54090|
|  8.0|53393|
|  9.0|53832|
| 10.0|53537|
+-----+-----+

Please tell me why QuantileDiscretizer will behave like this?

What if I want to group the raw data evenly?

Why are the results obtained by using the spark's QuantileDiscretizer grouped unevenly?

Answers (1)

Related Questions

Why are the results obtained by using the spark&#39;s QuantileDiscretizer grouped unevenly?

Answers (1)

Related Questions

Why are the results obtained by using the spark's QuantileDiscretizer grouped unevenly?