jmoneygram
jmoneygram

Reputation: 129

How can my Postgres query perform faster? Can I use Python to provide faster iteration?

This is a two-part question. If you're checking this out, thanks for your time!

  1. Is there a way to make my query faster?

    I previously asked a question here, and was eventually able to solve the problem myself.

    However, the query I devised to produce my desired results is VERY slow (25+ minutes) when run against my database, which contains 40,000+ records.

    The query is serving its purpose, but I'm hoping one of you brilliant people can point out to me how to make the query perform at a more preferred speed.

    My query:

    with dupe as (
        select
             json_document->'Firstname'->0->'Content' as first_name,
             json_document->'Lastname'->0->'Content' as last_name,
             identifiers->'RecordID' as record_id
        from (
            select *,  
                   jsonb_array_elements(json_document->'Identifiers') as identifiers
            from staging
        ) sub
        group by record_id, json_document
        order by last_name
    ) 
    
    select * from dupe da where (
      select count(*) from dupe db 
      where db.record_id = da.record_id
    ) > 1;
    

    Again, some sample data:

    Row 1:

    {
            "Firstname": "Bobb",
            "Lastname": "Smith",
            "Identifiers": [
                {
                    "Content": "123",
                    "RecordID": "123",
                    "SystemID": "Test",
                    "LastUpdated": "2017-09-12T02:23:30.817Z"
                },
                {
                    "Content": "abc",
                    "RecordID": "abc",
                    "SystemID": "Test",
                    "LastUpdated": "2017-09-13T10:10:21.598Z"
                },
                {
                    "Content": "def",
                    "RecordID": "def",
                    "SystemID": "Test",
                    "LastUpdated": "2017-09-13T10:10:21.598Z"
                }
            ]
    }
    

    Row 2:

    {
            "Firstname": "Bob",
            "Lastname": "Smith",
            "Identifiers": [
                {
                    "Content": "abc",
                    "RecordID": "abc",
                    "SystemID": "Test",
                    "LastUpdated": "2017-09-13T10:10:26.020Z"
                }
            ]
    }
    
  2. If I were to bring in my query's results, or a portion of the results, into a Python environment where they could be manipulated using Pandas, how could I iterate over the results of my query (or the sub-query) in order to achieve the same end result as with my original query?

    Is there an easier way, using Python, to iterate through my un-nested json array in the same way that Postgres does?

    For example, after performing this query:

    select
        json_document->'Firstname'->0->'Content' as first_name,
        json_document->'Lastname'->0->'Content' as last_name,
        identifiers->'RecordID' as record_id
    from (
           select *,  
                  jsonb_array_elements(json_document->'Identifiers') as identifiers
           from staging
         ) sub
    order by last_name;
    

    How, using Python/Pandas, can i take that query's results and perform something like:

    da = datasets[query_results]  # to equal my dupe da query
    db = datasets[query_results]  # to equal my dupe db query
    

    Then perform the equivalent of

    select * from dupe da where (
        select count(*) from dupe db 
        where db.record_id = da.record_id
    ) > 1;
    

    in Python?

I apologize if I do not provide enough information here. I am a Python novice. Any and all help is greatly appreciated! Thanks!!

Upvotes: 1

Views: 133

Answers (2)

Joe Love
Joe Love

Reputation: 5962

Try the following, which eliminates your count(*) and instead uses exists.

 with dupe as ( 
   select id, 
     json_document->'Firstname'->0->'Content' as first_name, 
     json_document->'Lastname'->0->'Content' as last_name, 
     identifiers->'RecordID' as record_id 
   from 
     (select 
       *, 
       jsonb_array_elements(json_document->'Identifiers') as identifiers 
      from staging ) sub 
      group by
        id,
        record_id, 
        json_document 
      order by last_name ) 
 select * from dupe da 
   where exists 
     (select * 
       from dupe db 
       where 
         db.record_id = da.record_id 
         and db.id != da.id
     )

Upvotes: 1

Parfait
Parfait

Reputation: 107642

Consider reading the raw, unqueried values of the Postgres json column type and use pandas json_normalize() to bind into a flat dataframe. From there use pandas drop_duplicates.

To demonstrate, below parses your one json data into three-row dataframe for each corresponding Identifiers records:

import json
import pandas as pd

json_str = '''
{
        "Firstname": "Bobb",
        "Lastname": "Smith",
        "Identifiers": [
            {
                "Content": "123",
                "RecordID": "123",
                "SystemID": "Test",
                "LastUpdated": "2017-09-12T02:23:30.817Z"
            },
            {
                "Content": "abc",
                "RecordID": "abc",
                "SystemID": "Test",
                "LastUpdated": "2017-09-13T10:10:21.598Z"
            },
            {
                "Content": "def",
                "RecordID": "def",
                "SystemID": "Test",
                "LastUpdated": "2017-09-13T10:10:21.598Z"
            }
        ]
}
'''

data = json.loads(json_str)    
df = pd.io.json.json_normalize(data, 'Identifiers', ['Firstname','Lastname'])

print(df)    
#   Content               LastUpdated RecordID SystemID Lastname Firstname
# 0     123  2017-09-12T02:23:30.817Z      123     Test    Smith      Bobb
# 1     abc  2017-09-13T10:10:21.598Z      abc     Test    Smith      Bobb
# 2     def  2017-09-13T10:10:21.598Z      def     Test    Smith      Bobb

For your database, consider connecting with your DB-API such as psycopg2 or sqlAlchemy and parse each json as a string accordingly. Admittedly, there may be other ways to handle json as seen in the psycopg2 docs but below receives data as text and parses on python side:

import psycopg2
conn = psycopg2.connect("dbname=test user=postgres")

cur = conn.cursor()    
cur.execute("SELECT json_document::text FROM staging;")

df = pd.io.json.json_normalize([json.loads(row[0]) for row in cur.fetchall()], 
                               'Identifiers', ['Firstname','Lastname'])

df = df.drop_duplicates(['RecordID'])

cur.close()
conn.close()

Upvotes: 1

Related Questions