How can my Postgres query perform faster? Can I use Python to provide faster iteration?

Question

This is a two-part question. If you're checking this out, thanks for your time!

Is there a way to make my query faster?

I previously asked a question here, and was eventually able to solve the problem myself.

However, the query I devised to produce my desired results is VERY slow (25+ minutes) when run against my database, which contains 40,000+ records.

The query is serving its purpose, but I'm hoping one of you brilliant people can point out to me how to make the query perform at a more preferred speed.

My query:

with dupe as (
    select
         json_document->'Firstname'->0->'Content' as first_name,
         json_document->'Lastname'->0->'Content' as last_name,
         identifiers->'RecordID' as record_id
    from (
        select *,  
               jsonb_array_elements(json_document->'Identifiers') as identifiers
        from staging
    ) sub
    group by record_id, json_document
    order by last_name
) 

select * from dupe da where (
  select count(*) from dupe db 
  where db.record_id = da.record_id
) > 1;

Again, some sample data:

Row 1:

{
        "Firstname": "Bobb",
        "Lastname": "Smith",
        "Identifiers": [
            {
                "Content": "123",
                "RecordID": "123",
                "SystemID": "Test",
                "LastUpdated": "2017-09-12T02:23:30.817Z"
            },
            {
                "Content": "abc",
                "RecordID": "abc",
                "SystemID": "Test",
                "LastUpdated": "2017-09-13T10:10:21.598Z"
            },
            {
                "Content": "def",
                "RecordID": "def",
                "SystemID": "Test",
                "LastUpdated": "2017-09-13T10:10:21.598Z"
            }
        ]
}

Row 2:

{
        "Firstname": "Bob",
        "Lastname": "Smith",
        "Identifiers": [
            {
                "Content": "abc",
                "RecordID": "abc",
                "SystemID": "Test",
                "LastUpdated": "2017-09-13T10:10:26.020Z"
            }
        ]
}

If I were to bring in my query's results, or a portion of the results, into a Python environment where they could be manipulated using Pandas, how could I iterate over the results of my query (or the sub-query) in order to achieve the same end result as with my original query?

Is there an easier way, using Python, to iterate through my un-nested json array in the same way that Postgres does?

For example, after performing this query:
```
select
    json_document->'Firstname'->0->'Content' as first_name,
    json_document->'Lastname'->0->'Content' as last_name,
    identifiers->'RecordID' as record_id
from (
       select *,  
              jsonb_array_elements(json_document->'Identifiers') as identifiers
       from staging
     ) sub
order by last_name;
```
How, using Python/Pandas, can i take that query's results and perform something like:
```
da = datasets[query_results]  # to equal my dupe da query
db = datasets[query_results]  # to equal my dupe db query
```
Then perform the equivalent of
```
select * from dupe da where (
    select count(*) from dupe db 
    where db.record_id = da.record_id
) > 1;
```
in Python?

I apologize if I do not provide enough information here. I am a Python novice. Any and all help is greatly appreciated! Thanks!!

Joe Love · Accepted Answer

Try the following, which eliminates your count(*) and instead uses exists.

 with dupe as ( 
   select id, 
     json_document->'Firstname'->0->'Content' as first_name, 
     json_document->'Lastname'->0->'Content' as last_name, 
     identifiers->'RecordID' as record_id 
   from 
     (select 
       *, 
       jsonb_array_elements(json_document->'Identifiers') as identifiers 
      from staging ) sub 
      group by
        id,
        record_id, 
        json_document 
      order by last_name ) 
 select * from dupe da 
   where exists 
     (select * 
       from dupe db 
       where 
         db.record_id = da.record_id 
         and db.id != da.id
     )

How can my Postgres query perform faster? Can I use Python to provide faster iteration?

Answers (2)

Related Questions