Raphael Sampaio
Raphael Sampaio

Reputation: 143

Histogram with Spark Dataframe in Java

Is it possible to generate a histogram dataframe with Spark 2.1 in Java from a Dataset<Row> table?

Upvotes: 0

Views: 2090

Answers (1)

Upendra Jeliya
Upendra Jeliya

Reputation: 11

  1. Convert the Dataset into JavaRDD where Datatype can be Integer, Double etc. using toJavaRDD().map() function.
  2. Again Convert the JavaRDD to JavaDoubleRDD using mapToDouble function.
  3. Then you can apply histogram(int bucketcount) to get the histogram of the data.

Example : I got a table in spark with table name as 'nation' having column as 'n_nationkey' which is Integer then this is how I did it:

String query = "select n_nationkey from nation" ;
Dataset<Row> df = spark.sql(query);
JavaRDD<Integer> jdf = df.toJavaRDD().map(row -> row.getInt(0));
JavaDoubleRDD example = jdf.mapToDouble(y -> y);
Tuple2<double[], long[]> resultsnew = example.histogram(5);

In case the column have a double type, you simply replace some things as :

JavaRDD<Double> jdf = df.toJavaRDD().map(row -> row.getDouble(0));
JavaDoubleRDD example = jdf.mapToDouble(y -> y);
Tuple2<double[], long[]> resultsnew = example.histogram(5);

Upvotes: 1

Related Questions