pixel
pixel

Reputation: 77

Unnesting a json in Redshift causing nested loop in the query plan

I have a column in my tables called 'data' with JSONs in it like below:

{"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}

I have written a code to unnest it into separate columns like tr,r,s. Below is the code

with raw as (
SELECT json_extract_path_text(B.Data, 'records', true) as items
FROM tableB as B  where B.date::timestamp between
 to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
 to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
UNION ALL
SELECT json_extract_path_text(C.Data, 'records', true) as items
FROM tableC as C where C.date-5 between
 to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
 to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
),

numbers as (
SELECT ROW_NUMBER() OVER (ORDER BY TRUE)::integer- 1 as ordinal
FROM <any_random_table> limit 1000
),

joined as (
    select raw.*,
        json_array_length(orders.items, true) as number_of_items,
        json_extract_array_element_text(
            raw.items,
            numbers.ordinal::int,
            true
            ) as item
    from raw
    cross join numbers    
    where numbers.ordinal <
        json_array_length(raw.items, true)
),
parsed as (
    SELECT  J.*,
        json_extract_path_text(J.item, 'tr',true) as tr,
        json_extract_path_text(J.item, 'r',true) as r,
        json_extract_path_text(J.item, 's',true)::float8 as s  
from joined J
)
select * from parsed 

The above code is working when there are small number of records but this taking more than a day to run and CPU utilization (in redshift) is reaching 100 % and even the disk space used also reaching 100% if I am putting date between last two years etc.. or if the number of records is large.

Can anyone please suggest any alternative way to unnnest JSON objects like above in redshift.

My query plan is saying:

Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products

Goal: To Unnest without using any cross joins Input: data column having JSON

"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}

Output should be for example tr,r,s columns from the above json

Upvotes: 0

Views: 683

Answers (1)

Bill Weiner
Bill Weiner

Reputation: 11057

You want to unnest json records of up to 1000 stored in a json array but nested loop join is taking too long.

The root issues is likely your data model. You have stored structured records (called "records"), inside a semi-structure text element (json), within a column of a structured columnar database. You want to perform some operation on these buried records that you haven't described but here's the problem. Columnar databases are optimized for performing read-centric analytic queries but you need to expand these json internal records into Redshift rows (records) which is fundamentally a write operation. This is working against the optimizations of the database.

The size of this expanding data is also large as compared to your disk storage on your cluster which is why the disks are filling up. You CPUs are likely spinning unpacking the jsons and managing overloaded disk and memory capacity. At the edge of filling up disks Redshift shifts to a mode that optimizes disk space utilization at the expense of execution speed. A larger cluster may give you a significantly faster execution if you can avoid this effect but that will cost money you may not have budgeted. Not an ideal solution.

One area that would improve speed of your query is not carrying all the data along. You keep raw.* and J.* all through the query but it is not clear you need these. Since part of the issue is data size during execution and that this execution includes loop joining, you are making the execution much harder that it needs to be by carrying all this data (including the original jsons).

The best way out of this situation is to change your data model and expand these json internal records into Redshift records on ingestion. Json data is fine for seldom used information or information that is only needed at the end of a query where the data is small. Needing the expanded json at the input end of the query for such a large amount of data is not good use case for json in Redshift. Each of these "records" inside of the json are records and need to be stored as such if you need to work across them as query input.

Now you want to know if there is some slick way to get around this issue in your case and the answer is "unlikely but maybe". Can you describe how you are using the final values in your query (t, r, and s)? If you are just using some aspect of this data (max value or sum or ...) then there may be a way to get to the answer without the large nested loop join. But if you need all the values then there is no other way to get these AFAIK. A description of what comes next in the data process could open up such an opportunity.

Upvotes: 2

Related Questions