Split an array column into chunks of max size

Question

I have a DataFrame with one column of array[string] type.

scala> df.printSchema
root
 |-- user: string (nullable = true) ### this is an unique key 
 |-- items: array (nullable = true)
 |    |-- element: string (containsNull = true)

Due to some limitations on the consumer's side, I need to limit the number of elements in the items column, e.g: to maximum 1000 elements. The outcome DataFrame would have the same schema, except there's no uniqueness on the items column anymore. For example, with max elements = 3:

Input DataFrame:

+----+----------------------+
|user|items                 |
+----+----------------------+
|u1  |[a, b, cc, d, e, f, g]|
|u2  |[h, ii]               |
|u3  |[j, kkkk, m, nn, o]   |
+----+----------------------+

Output DataFrame:

+----+------------+
|user|items       |
+----+------------+
|u1  |[a, f, g]   |
|u1  |[b, cc, d]  |
|u1  |[e]         |
|u2  |[h, ii]     |
|u3  |[j, nn, m]  |
|u3  |[kkkk, o]   |
+----+------------+

The order of items is not important. The value of each item is just a string of alphanumeric chars, but the size of each item is not fixed.

Performance is not an issue, the DataFrame is small but we need the solution in SparkSQL.

Split an array column into chunks of max size

Answers (1)

Related Questions