Reputation: 7265
I want to merge an excel file with sql in pandas, here's my code
import pandas as pd
import pymysql
from sqlalchemy import create_engine
data1 = pd.read_excel('data.xlsx')
engine = create_engine('...cloudprovider.com/...')
data2 = pd.read_sql_query("select id, column3, column4 from customer", engine)
data = data1.merge(data2, on='id', how='left')
It works, just to make it clearer
If input data1.columns
the output Index(['id', 'column1', 'column2'], dtype='object')
If input data2.columns
the output Index(['id', 'column3', 'column4'], dtype='object')
If input data.columns
the output Index(['id', 'column1', 'column2', 'column3', 'column4'], dtype='object')
Since the data2
getting bigger, I can't query entirely, so I want to query data2
with id
that exist on data1
. How suppose I do this?
Upvotes: 0
Views: 240
Reputation: 52937
You could leverage the fact that SQLAlchemy is a great query builder. Either reflect the customer table, or build the metadata by hand:
from sqlalchemy import MetaData, select
metadata = MetaData()
metadata.reflect(engine, only=['customer'])
customer = metadata.tables['customer']
and build your query, letting SQLAlchemy worry about proper usage of placeholders, data conversion etc. You're looking for customer rows where id is in the set of ids from data1, achieved in SQL with the IN operator:
query = select([customer.c.id,
customer.c.column3,
customer.c.column4]).\
where(customer.c.id.in_(data1['id']))
data2 = pd.read_sql_query(query, engine)
If you wish to keep on using SQL strings manually, you could build a parameterized query as such:
placeholders = ','.join(['%s'] * data1['id'].count())
# Note that you're not formatting the actual values here, but placeholders
query = f"SELECT id, column3, column4 FROM customer WHERE id IN ({placeholders})"
data2 = pd.read_sql_query(query, engine, params=data1['id'])
In general it is beneficial to learn to use placeholders instead of mixing SQL and values by formatting/concatenating strings, as it may expose you to SQL injection, if handling user generated data. Usually you'd write required placeholders in the query string directly, but some string building is required, if you have a variable amount of parameters1.
1: Some DB-API drivers, such as psycopg2, allow passing tuples and lists as scalar values and know how to construct suitable SQL.
Upvotes: 1
Reputation: 1406
Since you are looking into a condition as WHERE IN [Some_List]. This should work for you
id_list = data1['id'].tolist()
your_query = "select id, column3, column4 from customer where id in "+tuple(id_list)
data2 = pd.read_sql_query(your_query , engine)
Hope it works.
Upvotes: 0