Unknown duplicates from querying a nested JSON

Question

I would like to do text search in a JSON object in a table.

I have a table called Audio that is structured like below:

id| keyword | transcript | user_id | company_id | client_id 
-----------------------------------------------------------

This is the JSON data structure of transcript:

{"transcript": [
    {"duration": 2390.0, 
    "interval": [140.0, 2530.0], 
    "speaker": "Speaker_2", 
    "words": [
        {"p": 0, "s": 0, "e": 320, "c": 0.545, "w": "This"}, 
        {"p": 1, "s": 320, "e": 620, "c": 0.825, "w": "call"}, 
        {"p": 2, "s": 620, "e": 780, "c": 0.909, "w": "is"}, 
        {"p": 3, "s": 780, "e": 1010, "c": 0.853, "w": "being"}, 
        {"p": 4, "s": 1010, "e": 1250, "c": 0.814, "w": "recorded"}
        ]
    }, 
    {"duration": 4360.0, 
    "interval": [3280.0, 7640.0], 
    "speaker": "Speaker_1", 
    "words": [
        {"p": 5, "s": 5000, "e": 5020, "c": 0.079, "w": "as"},
        {"p": 6, "s": 5020, "e": 5100, "c": 0.238, "w": "a"},
        {"p": 7, "s": 5100, "e": 5409, "c": 0.689, "w": "group"},
        {"p": 8, "s": 5410, "e": 5590, "c": 0.802, "w": "called"},
        {"p": 9, "s": 5590, "e": 5870, "c": 0.834, "w": "tricks"}
        ]
    },
    ...
}

What I am trying to do is to do a text search in the "w" field within "words". This is the query that I tried to run:

WITH info_data AS (
    SELECT transcript_info->'words' AS info
    FROM Audio t, json_array_elements(transcript->'transcript') AS transcript_info)
SELECT info_item->>'w', id
FROM Audio, info_data idata, json_array_elements(idata.info) AS info_item
WHERE info_item->>'w' ilike '%this';

Right now I only have four columns with data and the fifth column is null. And there are five columns in total. However, I got the following result where even the column that doesn't have data results an output:

?column? | id 
----------+----
 This     |  2
 This     |  5
 This     |  1
 This     |  3
 This     |  4
 This     |  2
 This     |  5

I would love to know what the problem of my query is and whether there are more efficient way in doing this.

Patrick · Accepted Answer

The problem is that you make a cartesian join between table Audio on the one hand and info_data and info_item on the other hand (there is an implicit lateral join between these latter two) here:

FROM Audio, info_data idata, json_array_elements(idata.info) AS info_item

You can solve this by adding Audio.id to the CTE and then adding WHERE Audio.id = info_data.id.

It is doubtful that this is the most efficient solution (CTEs rarely are). If you just want to get those rows where the word "this" is a word in the transcript, then you are most likely better off like this:

SELECT DISTINCT id
FROM (
    SELECT id, transcript_info->'words' AS info
    FROM Audio, json_array_elements(transcript->'transcript') AS transcript_info) AS t,
    json_array_elements(info) AS words
WHERE words->>'w' ILIKE 'this';

Note that the % in the pattern string is very inefficient. Since very few words in the English language other than "this" end with the same, I have taken the liberty of removing it.

Unknown duplicates from querying a nested JSON

Answers (1)

Related Questions