Reputation: 129
This is a two-part question. If you're checking this out, thanks for your time!
Is there a way to make my query faster?
I previously asked a question here, and was eventually able to solve the problem myself.
However, the query I devised to produce my desired results is VERY slow (25+ minutes) when run against my database, which contains 40,000+ records.
The query is serving its purpose, but I'm hoping one of you brilliant people can point out to me how to make the query perform at a more preferred speed.
My query:
with dupe as (
select
json_document->'Firstname'->0->'Content' as first_name,
json_document->'Lastname'->0->'Content' as last_name,
identifiers->'RecordID' as record_id
from (
select *,
jsonb_array_elements(json_document->'Identifiers') as identifiers
from staging
) sub
group by record_id, json_document
order by last_name
)
select * from dupe da where (
select count(*) from dupe db
where db.record_id = da.record_id
) > 1;
Again, some sample data:
Row 1:
{
"Firstname": "Bobb",
"Lastname": "Smith",
"Identifiers": [
{
"Content": "123",
"RecordID": "123",
"SystemID": "Test",
"LastUpdated": "2017-09-12T02:23:30.817Z"
},
{
"Content": "abc",
"RecordID": "abc",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
},
{
"Content": "def",
"RecordID": "def",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
}
]
}
Row 2:
{
"Firstname": "Bob",
"Lastname": "Smith",
"Identifiers": [
{
"Content": "abc",
"RecordID": "abc",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:26.020Z"
}
]
}
If I were to bring in my query's results, or a portion of the results, into a Python environment where they could be manipulated using Pandas, how could I iterate over the results of my query (or the sub-query) in order to achieve the same end result as with my original query?
Is there an easier way, using Python, to iterate through my un-nested json array in the same way that Postgres does?
For example, after performing this query:
select
json_document->'Firstname'->0->'Content' as first_name,
json_document->'Lastname'->0->'Content' as last_name,
identifiers->'RecordID' as record_id
from (
select *,
jsonb_array_elements(json_document->'Identifiers') as identifiers
from staging
) sub
order by last_name;
How, using Python/Pandas, can i take that query's results and perform something like:
da = datasets[query_results] # to equal my dupe da query
db = datasets[query_results] # to equal my dupe db query
Then perform the equivalent of
select * from dupe da where (
select count(*) from dupe db
where db.record_id = da.record_id
) > 1;
in Python?
I apologize if I do not provide enough information here. I am a Python novice. Any and all help is greatly appreciated! Thanks!!
Upvotes: 1
Views: 133
Reputation: 5962
Try the following, which eliminates your count(*) and instead uses exists.
with dupe as (
select id,
json_document->'Firstname'->0->'Content' as first_name,
json_document->'Lastname'->0->'Content' as last_name,
identifiers->'RecordID' as record_id
from
(select
*,
jsonb_array_elements(json_document->'Identifiers') as identifiers
from staging ) sub
group by
id,
record_id,
json_document
order by last_name )
select * from dupe da
where exists
(select *
from dupe db
where
db.record_id = da.record_id
and db.id != da.id
)
Upvotes: 1
Reputation: 107642
Consider reading the raw, unqueried values of the Postgres json column type and use pandas json_normalize()
to bind into a flat dataframe. From there use pandas drop_duplicates
.
To demonstrate, below parses your one json data into three-row dataframe for each corresponding Identifiers records:
import json
import pandas as pd
json_str = '''
{
"Firstname": "Bobb",
"Lastname": "Smith",
"Identifiers": [
{
"Content": "123",
"RecordID": "123",
"SystemID": "Test",
"LastUpdated": "2017-09-12T02:23:30.817Z"
},
{
"Content": "abc",
"RecordID": "abc",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
},
{
"Content": "def",
"RecordID": "def",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
}
]
}
'''
data = json.loads(json_str)
df = pd.io.json.json_normalize(data, 'Identifiers', ['Firstname','Lastname'])
print(df)
# Content LastUpdated RecordID SystemID Lastname Firstname
# 0 123 2017-09-12T02:23:30.817Z 123 Test Smith Bobb
# 1 abc 2017-09-13T10:10:21.598Z abc Test Smith Bobb
# 2 def 2017-09-13T10:10:21.598Z def Test Smith Bobb
For your database, consider connecting with your DB-API such as psycopg2
or sqlAlchemy and parse each json as a string accordingly. Admittedly, there may be other ways to handle json as seen in the psycopg2 docs but below receives data as text and parses on python side:
import psycopg2
conn = psycopg2.connect("dbname=test user=postgres")
cur = conn.cursor()
cur.execute("SELECT json_document::text FROM staging;")
df = pd.io.json.json_normalize([json.loads(row[0]) for row in cur.fetchall()],
'Identifiers', ['Firstname','Lastname'])
df = df.drop_duplicates(['RecordID'])
cur.close()
conn.close()
Upvotes: 1