How do I make inner join with external joiner on sql in python in efficient way

Question

I want to merge an excel file with sql in pandas, here's my code

import pandas as pd
import pymysql
from sqlalchemy import create_engine
data1 = pd.read_excel('data.xlsx')
engine = create_engine('...cloudprovider.com/...')
data2 = pd.read_sql_query("select id, column3, column4 from customer", engine)
data = data1.merge(data2, on='id', how='left')

It works, just to make it clearer

If input data1.columns the output Index(['id', 'column1', 'column2'], dtype='object')

If input data2.columns the output Index(['id', 'column3', 'column4'], dtype='object')

If input data.columns the output Index(['id', 'column1', 'column2', 'column3', 'column4'], dtype='object')

Since the data2 getting bigger, I can't query entirely, so I want to query data2 with id that exist on data1. How suppose I do this?

Ilja Everil&#228; · Accepted Answer

You could leverage the fact that SQLAlchemy is a great query builder. Either reflect the customer table, or build the metadata by hand:

from sqlalchemy import MetaData, select

metadata = MetaData()
metadata.reflect(engine, only=['customer'])
customer = metadata.tables['customer']

and build your query, letting SQLAlchemy worry about proper usage of placeholders, data conversion etc. You're looking for customer rows where id is in the set of ids from data1, achieved in SQL with the IN operator:

query = select([customer.c.id,
                customer.c.column3,
                customer.c.column4]).\
    where(customer.c.id.in_(data1['id']))

data2 = pd.read_sql_query(query, engine)

If you wish to keep on using SQL strings manually, you could build a parameterized query as such:

placeholders = ','.join(['%s'] * data1['id'].count())
# Note that you're not formatting the actual values here, but placeholders
query = f"SELECT id, column3, column4 FROM customer WHERE id IN ({placeholders})"
data2 = pd.read_sql_query(query, engine, params=data1['id'])

In general it is beneficial to learn to use placeholders instead of mixing SQL and values by formatting/concatenating strings, as it may expose you to SQL injection, if handling user generated data. Usually you'd write required placeholders in the query string directly, but some string building is required, if you have a variable amount of parameters¹.

¹: Some DB-API drivers, such as psycopg2, allow passing tuples and lists as scalar values and know how to construct suitable SQL.

How do I make inner join with external joiner on sql in python in efficient way

Answers (2)

Related Questions