JBoy
JBoy

Reputation: 5735

Spark SQL java add column with count of distinct rows

Very new to SQL and Spark and i am trying to add a column on a dataset containing the count distinct. Dataset:

| col1 | col2 | 
|   A  |  B   |
|   C  |  D   |
|   A  |  B   |
|   A  |  B   |

Desired result:

| col1 | col2 | uniques |
|   A  |  B   |   3     |
|   C  |  D   |   1     |

My Java code:

return dataset.agg(countDistinct(col1,col2));

But there is no effect

Upvotes: 0

Views: 807

Answers (1)

gudok
gudok

Reputation: 4179

Distinct is not applicable here. Count distinct removes duplicates in every group before counting, hence "uniques" column always will contain only ones.

To get the results you want, you need to perform basic grouping/aggregating operation. Below is the number of ways to achieve it:

SparkSession spark = ...;

StructType schema = new StructType(new StructField[]{
    new StructField("col1", DataTypes.StringType, true, new MetadataBuilder().build()),
    new StructField("col2", DataTypes.StringType, true, new MetadataBuilder().build())
});

List<Row> rows = new ArrayList<>();
rows.add(RowFactory.create("A", "B"));
rows.add(RowFactory.create("C", "D"));
rows.add(RowFactory.create("A", "B"));
rows.add(RowFactory.create("A", "B"));

Dataset<Row> ds = spark.createDataFrame(rows, schema);
ds.createTempView("table");

// (1)
spark.sql("select col1, col2, count(*) as uniques from table group by col1, col2").show();

// (2)
ds.groupBy(ds.col("col1"), ds.col("col2")).count().show();

// (3)
ds.groupBy(ds.col("col1"), ds.col("col2"))
  .agg(functions.count(functions.lit(1)).alias("uniques") /*, functions.avg(...), functions.sum(...) */)
  .show();

First example is what is known as "Spark SQL".

The syntax of (2) and (3) may be hard to understand. I will try to explain them in very basic terms. groupBy groups data (logically) into something like Map<GroupKey, List<Row>>. count applies counting aggregate function for every group (the result of this function is new column) and "throws away" List<Row>. Hence in the result we have a table consisting of "col1", "col2" (they are automatically added because they are the grouping keys) and new column "uniques".

Sometimes you will need to apply multiple aggregate functions simultaneously. Third example addresses this issue. You can list multiple functions inside agg. Every such function will result in a new column.

Upvotes: 1

Related Questions