Reputation: 86735

Amazon Redshift - Pivot Large JSON Arrays

I have an optimisation problem.

I have a table containing about 15MB of JSON stored as rows of VARCHAR(65535). Each JSON string is an array of arbitrary size.

95% contains 16 or fewer elements
the longest (to date) contains 67 elements
the hard limit is 512 elements (before 64kB isn't big enough)

The task is simple, pivot each array such that each element has its own row.

 id | json
----+---------------------------------------------
 01 | [{"something":"here"}, {"fu":"bar"}]

=>

 id | element_id | json
----+------------+---------------------------------
 01 |     1      | {"something":"here"}
 01 |     2      | {"fu":"bar"}

Without having any kind of table valued functions (user defined or otherwise), I've resorted to pivoting via joining against a numbers table.

SELECT
    src.id,
    pvt.element_id,
    json_extract_array_element_text(
        src.json,
        pvt.element_id
    )
        AS json
FROM
    source_table         AS src
INNER JOIN
    numbers_table        AS pvt(element_id)
       ON pvt.element_id < json_array_length(src.json)

The numbers table has 512 rows in it (0..511), and the results are correct.

The elapsed time is horrendous. And it's not to do with distribution or sort order or encoding. It's to do with (I believe) redshift's materialisation.

The working memory needed to process 15MB of JSON text is 7.5GB.

15MB * 512 rows in numbers = 7.5GB

If I put just 128 rows in numbers then the working memory needed reduces by 4x and the elapsed time similarly reduces (not 4x, the real query does other work, it's still writing the same amount of results data, etc, etc).

So, I wonder, what about adding this?

WHERE
    pvt.element_id < (SELECT MAX(json_array_length(src.json)) FROM source_table)

No change to the working memory needed, the elapsed time goes up slightly (effectively a WHERE clause that has a cost but no benefit).

I've tried making a CTE to create the list of 512 numbers, that didn't help. I've tried making a CTE to create the list of numbers, with a WHERE clause to limit the size, that didn't help (effectively Redshift appears to have materialised using the 512 rows and THEN applied the WHERE clause).

My current effort is to create a temporary table for the numbers, limited by the WHERE clause. In my sample set this means that I get a table with 67 rows to join on, instead of 512 rows.

That's still not great, as that ONE row with 67 elements dominates the elapsed time (every row, no matter how many elements, gets duplicated 67 times before the ON pvt.element_id < json_array_length(src.json) gets applied).

My next effort will be to work on it in two steps.

As above, but with a table of only 16 rows, and only for row with 16 or fewer elements
As above, with the dynamically mixed numbers table, and only for rows with more than 16 elements

Question: Does anyone have any better ideas?

Upvotes: 1

Answers (3)

AlexYes

Reputation: 4208

Maybe if you avoid parsing and interpreting JSON as JSON and instead work with this as text it can work faster. If you're sure about the structure of your JSON values (which I guess you are since the original query does not produce the JSON parsing error) you might try just to use split_part function instead of json_extract_array_element_text.

If your elements don't contain commas you can use:

split_part(src.json,',',pvt.element_id)

if your elements contain commas you might use

split_part(src.json,'},{',pvt.element_id)

Also, the part with ON pvt.element_id < json_array_length(src.json) in the join condition is still there, so to avoid JSON parsing completely you might try to cross join and then filter out non-null values.

Upvotes: 0

Joe Harris

Reputation: 14035

Please consider declaring the JSON as an external table. You can then use Redshift Spectrum's nested data syntax to access these values as if they were rows.

There is a quick tutorial here: "Tutorial: Querying Nested Data with Amazon Redshift Spectrum"

Simple example:

{ "id": 1
 ,"name":   { "given":"John", "family":"Smith" }
 ,"orders": [ {"price": 100.50, "quantity": 9 }
             ,{"price":  99.12, "quantity": 2 } 
            ]
}

CREATE EXTERNAL TABLE spectrum.nested_tutorial
     (id      int
     ,name    struct<given:varchar(20), family:varchar(20)>
     ,orders  array<struct<price:double precision, quantity:double precision>>
     ) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-files/temp/nested_data/nested_tutorial/'
;

SELECT c.id
      ,c.name.given
      ,c.name.family
      ,o.price
      ,o.quantity
FROM spectrum.nested_tutorial c
LEFT JOIN c.orders o ON true
;

 id | given | family | price | quantity
----+-------+--------+-------+----------
  1 | John  | Smith  | 100.5 |        9
  1 | John  | Smith  | 99.12 |        2

Upvotes: 4

John Rotenstein

Reputation: 269410

Neither the data format, nor the task you wish to do, is ideal for Amazon Redshift.

Amazon Redshift is excellent as a data warehouse, with the ability to do queries against billions of rows. However, storing data as JSON is sub-optimal because Redshift cannot use all of its abilities (eg Distribution Keys, Sort Keys, Zone Maps, Parallel processing) while processing fields stored in JSON.

The efficiency of your Redshift cluster would be much higher if the data were stored as:

 id | element_id | key        | value
----+------------+---------------------
 01 |     1      | something  | here
 01 |     2      | fu         | bar

As to how to best convert the existing JSON data into separate rows, I would frankly recommend that this is done outside of Redshift, then loaded into tables via the COPY command. A small Python script would be more efficient at converting the data that trying strange JOINs on a numbers table in Redshift.

Upvotes: 0

Amazon Redshift - Pivot Large JSON Arrays

Answers (3)

Related Questions