Reputation: 323
I have a Dataset.
The feature columns are grouped using the org.apache.spark.ml.feature.QuantileDiscretizer
class of spark 2.3.1, and the resulting model grouping results are not uniform.
The data reflected in the last packet is almost twice as much as the other packets, and I set 11 packets in the parameter, but only 10 packets are actually obtained.
Look at the program below.
import org.apache.spark.ml.feature.QuantileDiscretizer
import org.apache.spark.ml.feature.Bucketizer
val model = new QuantileDiscretizer()
.setInputCol("features")
.setOutputCol("level")
.setNumBuckets(11)
.setHandleInvalid("keep")
.fit(df)
println(model.getSplits.mkString(", "))
model
.transform(df)
.groupBy("level")
.count
.orderBy("level")
.show
The output:
-Infinity, 115.0, 280.25, 479.75, 712.5, 1000.0, 1383.37, 1892.75, 2690.93, 4305.0, Infinity
+-----+------+
|level| count|
+-----+------+
| null| 9113|
| 0.0| 55477|
| 1.0| 52725|
| 2.0| 54657|
| 3.0| 53592|
| 4.0| 54165|
| 5.0| 54732|
| 6.0| 52915|
| 7.0| 54090|
| 8.0| 53393|
| 9.0|107369|
+-----+------+
Grouping the last set of data separately:
val df1 = df.where("features >= 4305.0")
val model1 = new QuantileDiscretizer()
.setInputCol("features")
.setOutputCol("level")
.setNumBuckets(2)
.setHandleInvalid("keep")
.fit(df1)
println(model1.getSplits.mkString(", "))
model1
.transform(df1)
.groupBy("level")
.count
.orderBy("level")
.show
The output:
-Infinity, 20546.12, Infinity
+-----+-----+
|level|count|
+-----+-----+
| 0.0|53832|
| 1.0|53537|
+-----+-----+
If I manually specify the grouper boundaries to group:
val splits = Array(Double.NegativeInfinity,
115.0, 280.25, 479.75, 712.5, 1000.0, 1383.37, 1892.75, 2690.93, 4305.0,
20546.12, Double.PositiveInfinity)
val model = new Bucketizer()
.setInputCol("features")
.setOutputCol("level")
.setHandleInvalid("keep")
.setSplits(splits)
model
.transform(df)
.groupBy("level")
.count
.orderBy("level")
.show
The output:
+-----+-----+
|level|count|
+-----+-----+
| null| 9113|
| 0.0|55477|
| 1.0|52725|
| 2.0|54657|
| 3.0|53592|
| 4.0|54165|
| 5.0|54732|
| 6.0|52915|
| 7.0|54090|
| 8.0|53393|
| 9.0|53832|
| 10.0|53537|
+-----+-----+
Please tell me why QuantileDiscretizer
will behave like this?
What if I want to group the raw data evenly?
Upvotes: 3
Views: 1360
Reputation: 11
Set the relative error to a small number, e.g.,
qds = QuantileDiscretizer(
numBuckets=10,
inputCol="score_rand",
outputCol="buckets",
relativeError=0.0001,
handleInvalid="error")
I believe that if you still do not obtain almost even groups it is because there are ties within your bucketed column. Then try and add a small random number, larger than the relative error, and you should get the desired number of buckets.
Upvotes: 0