Reputation: 86735
I have an optimisation problem.
I have a table containing about 15MB of JSON stored as rows of VARCHAR(65535)
. Each JSON string is an array of arbitrary size.
The task is simple, pivot each array such that each element has its own row.
id | json
----+---------------------------------------------
01 | [{"something":"here"}, {"fu":"bar"}]
=>
id | element_id | json
----+------------+---------------------------------
01 | 1 | {"something":"here"}
01 | 2 | {"fu":"bar"}
Without having any kind of table valued functions (user defined or otherwise), I've resorted to pivoting via joining against a numbers table.
SELECT
src.id,
pvt.element_id,
json_extract_array_element_text(
src.json,
pvt.element_id
)
AS json
FROM
source_table AS src
INNER JOIN
numbers_table AS pvt(element_id)
ON pvt.element_id < json_array_length(src.json)
The numbers table has 512 rows in it (0..511), and the results are correct.
The elapsed time is horrendous. And it's not to do with distribution or sort order or encoding. It's to do with (I believe) redshift's materialisation.
The working memory needed to process 15MB of JSON text is 7.5GB.
numbers
= 7.5GBIf I put just 128 rows in numbers
then the working memory needed reduces by 4x and the elapsed time similarly reduces (not 4x, the real query does other work, it's still writing the same amount of results data, etc, etc).
So, I wonder, what about adding this?
WHERE
pvt.element_id < (SELECT MAX(json_array_length(src.json)) FROM source_table)
No change to the working memory needed, the elapsed time goes up slightly (effectively a WHERE clause that has a cost but no benefit).
I've tried making a CTE to create the list of 512 numbers, that didn't help. I've tried making a CTE to create the list of numbers, with a WHERE clause to limit the size, that didn't help (effectively Redshift appears to have materialised using the 512 rows and THEN applied the WHERE clause).
My current effort is to create a temporary table for the numbers, limited by the WHERE clause. In my sample set this means that I get a table with 67 rows to join on, instead of 512 rows.
That's still not great, as that ONE row with 67 elements dominates the elapsed time (every row, no matter how many elements, gets duplicated 67 times before the ON pvt.element_id < json_array_length(src.json)
gets applied).
My next effort will be to work on it in two steps.
As above, but with a table of only 16 rows, and only for row with 16 or fewer elements
As above, with the dynamically mixed numbers table, and only for rows with more than 16 elements
Question: Does anyone have any better ideas?
Upvotes: 1
Views: 3653
Reputation: 4208
Maybe if you avoid parsing and interpreting JSON as JSON and instead work with this as text it can work faster. If you're sure about the structure of your JSON values (which I guess you are since the original query does not produce the JSON parsing error) you might try just to use split_part
function instead of json_extract_array_element_text
.
If your elements don't contain commas you can use:
split_part(src.json,',',pvt.element_id)
if your elements contain commas you might use
split_part(src.json,'},{',pvt.element_id)
Also, the part with ON pvt.element_id < json_array_length(src.json)
in the join condition is still there, so to avoid JSON parsing completely you might try to cross join and then filter out non-null values.
Upvotes: 0
Reputation: 14035
Please consider declaring the JSON as an external table. You can then use Redshift Spectrum's nested data syntax to access these values as if they were rows.
There is a quick tutorial here: "Tutorial: Querying Nested Data with Amazon Redshift Spectrum"
Simple example:
{ "id": 1
,"name": { "given":"John", "family":"Smith" }
,"orders": [ {"price": 100.50, "quantity": 9 }
,{"price": 99.12, "quantity": 2 }
]
}
CREATE EXTERNAL TABLE spectrum.nested_tutorial
(id int
,name struct<given:varchar(20), family:varchar(20)>
,orders array<struct<price:double precision, quantity:double precision>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-files/temp/nested_data/nested_tutorial/'
;
SELECT c.id
,c.name.given
,c.name.family
,o.price
,o.quantity
FROM spectrum.nested_tutorial c
LEFT JOIN c.orders o ON true
;
id | given | family | price | quantity
----+-------+--------+-------+----------
1 | John | Smith | 100.5 | 9
1 | John | Smith | 99.12 | 2
Upvotes: 4
Reputation: 269410
Neither the data format, nor the task you wish to do, is ideal for Amazon Redshift.
Amazon Redshift is excellent as a data warehouse, with the ability to do queries against billions of rows. However, storing data as JSON is sub-optimal because Redshift cannot use all of its abilities (eg Distribution Keys, Sort Keys, Zone Maps, Parallel processing) while processing fields stored in JSON.
The efficiency of your Redshift cluster would be much higher if the data were stored as:
id | element_id | key | value
----+------------+---------------------
01 | 1 | something | here
01 | 2 | fu | bar
As to how to best convert the existing JSON data into separate rows, I would frankly recommend that this is done outside of Redshift, then loaded into tables via the COPY
command. A small Python script would be more efficient at converting the data that trying strange JOINs on a numbers
table in Redshift.
Upvotes: 0