JWorrell
JWorrell

Reputation: 93

Databricks SQL string_agg

Hopefully a quick one;

Migrating some on-premise SQL views to Databricks and struggling to find conversions for some functions. The main one is the string_agg function.

string_agg(field_name, ', ')

Anyone know how to convert that to Databricks SQL?

Thanks in advance.

Upvotes: 8

Views: 17186

Answers (5)

Lukas Eder
Lukas Eder

Reputation: 221155

If you're looking for a translation of standard SQL LISTAGG with ORDER BY support, such as this:

SELECT listagg(b, ', ') WITHIN GROUP (ORDER BY a)
FROM VALUES (3, 'a'), (1, 'c'), (2, 'b') AS t (a, b)

Then, you'll have to combine:

  • ARRAY_AGG for aggregating data into an array (of tuples)
  • ARRAY_SORT to sort this array
  • TRANSFORM to extract the desired data from the array of tuples into an array of scalar values
  • ARRAY_JOIN for aggregating array data into a string

Explicitly:

SELECT array_join(
  transform(
    array_sort(
      array_agg((a, b)),
      -- Optionally, add null handling as well
      (e1, e2) -> CASE WHEN e1.a < e2.a THEN -1 WHEN e1.a > e2.a THEN 1 ELSE 0 END
    ), 
    e -> e.b
  ), ', '
) s
FROM VALUES (3, 'a'), (1, 'c'), (2, 'b') AS t (a, b);

Producing:

c, b, a

Add ARRAY_DISTINCT to the call chain if you want LISTAGG(DISTINCT ..) behaviour. If you want to work with a LISTAGG window function, just turn the ARRAY_AGG aggregate function into a window function, e.g.:

SELECT array_join(
  transform(
    array_sort(
      array_agg((a, b)) OVER (ORDER BY a), -- window function here
     (e1, e2) -> CASE WHEN e1.a < e2.a THEN -1 WHEN e1.a > e2.a THEN 1 ELSE 0 END
    ), 
    e -> e.b
  ), ', '
) s
FROM VALUES (3, 'a'), (1, 'c'), (2, 'b') AS t (a, b)
ORDER BY a DESC;

Producing:

|s      |
|-------|
|c, b, a|
|c, b   |
|c      |

Upvotes: 1

Ryan Wade
Ryan Wade

Reputation: 31

Thanks for the answer @wBob. I am able to guarantee sort by modifying your code in the following way:

array_join(array_sort(collect_set(col2)),",") j

The array_sort() sorts the items returned by collect_set() and array_join() converts that output into a single string.

Upvotes: 3

Databricks SQL support is for basic SQL queries only . So procedure-oriented queries are not supported with current Databricks SQL version . This would fall under a new feature request. You can handle basic SQL functions only link

Note: Databricks SQL provides a simple experience for SQL users who want to run quick ad-hoc queries on their data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards. It is not supposed to replace ETL workloads running in Python/PySpark which we are currently handling .

Upvotes: -2

wBob
wBob

Reputation: 14389

The rough equivalent would be using collect_set and array_join but note you have lost the order:

%sql
SELECT col1, array_join(collect_set(col2), ',') j
FROM tmp
GROUP BY col1

I do not think STRING_AGG guarantees order (unless you specify the WITHIN GROUP...ORDER BY clause) but you should expect the order not to match. Hopefully the order does not matter to your process but you should double-check it does not have any implications for your process. As per the official documentation:

[collect_list] is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

They have recently added the ordinal argument to STRING_AGG to Azure SQL DB, Managed Instance and Synapse, but presumably you don't yet have that feature on-premises anyway.

Upvotes: 21

Hubert Dudek
Hubert Dudek

Reputation: 1730

You can use concat functions as described here https://spark.apache.org/docs/latest/api/sql/index.html#concat_ws

SELECT concat_ws(' ', 'Spark', 'SQL');

Upvotes: -3

Related Questions