How to aggregate data into ranges (bucketize)?

Question

I have a table like

+---------------+------+
|id             | value|
+---------------+------+
|               1|118.0|
|               2|109.0|
|               3|113.0|
|               4| 82.0|
|               5| 60.0|
|               6|111.0|
|               7|107.0|
|               8| 84.0|
|               9| 91.0|
|              10|118.0|
+---------------+------+

ans would like aggregate or bin the values to a range 0,10,20,30,40,...80,90,100,110,120how can I perform this in SQL or more specific spark sql?

Currently I have a lateral view join with the range but this seems rather clumsy / inefficient.

The quantile discretized is not really what I want, rather a CUT with this range.

edit

https://github.com/collectivemedia/spark-ext/blob/master/sparkext-mllib/src/main/scala/org/apache/spark/ml/feature/Binning.scala would perform dynamic bins, but I would rather need this specified range.

Bertram Gilfoyle · Accepted Answer

Try "GROUP BY" with this

SELECT id, (value DIV 10)*10 FROM table_name ;

The following would be using the Dataset API for Scala:

df.select(('value divide 10).cast("int")*10)

How to aggregate data into ranges (bucketize)?

edit

Answers (2)

Related Questions