IRSAgent
IRSAgent

Reputation: 339

Using Pandas Dataframe within a SQL Join

I'm trying to perform a SQL join on the the contents of a dataframe with an external table I have in a Postgres Database.

This is what the Dataframe looks like:

>>> df
   name  author  count
0  a     b       10
1  c     d       5
2  e     f       2

I need to join it with a Postgres table that looks like this:

TABLE: blog
title   author    url    
a       b         w.com
b       b         x.com
e       g         y.com

This is what I'm attempting to do, but this doesn't appear to be the right syntax for the query:

>>> sql_join = r"""select b.*, frame.*  from ({0}) frame
        join blog b
        on frame.name = b.title
        where frame.owner = b.owner 
        order by frame.count desc
        limit 30;""".format(df)

>>> res = pd.read_sql(sql_join, connection)

I'm not sure how I can use the values in the dataframes within the sql query. Can someone point me in the right direction? Thanks!

Edit: As per my use case, I'm not able to convert the blog table into a dataframe given memory and performance constraints.

Upvotes: 7

Views: 20677

Answers (4)

roboes
roboes

Reputation: 401

I had a similar issue using the SQLAlchemy package and SQLite database connection. This answer did not work for me. Adapting it a bit, I was able to achieve the desired result:

# Import packages
import pandas as pd
# from sqlalchemy import create_engine

# Create sample DataFrame
df = pd.DataFrame(data=[[1, 'A'], [2, 'B'], [3, 'C']], index=None, columns=None, dtype=None)

# Create a list of tuples from the DataFrame
data = str([tuple(x) for x in df.values]).replace('[', '').replace(']', '')

# Create a SQL table from the DataFrame (incl. column names)
inner_table = """
SELECT [column1] AS column_a,
    CAST([column2] AS TEXT) AS column_b
FROM (VALUES {})
""".format(data)

# Query
query = """
SELECT table.*
FROM table
INNER JOIN ({}) AS inner_table
ON table.column_a = inner_table.column_a
""".format(inner_table)

# Read SQL table
pd.read_sql_query(sql=query, con=connection, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None, dtype=None)

Upvotes: 0

Benfunks
Benfunks

Reputation: 1

I've had similar problems. I found a work-around that allows me to join two different servers where i only have read-only rights. using sqlalchemy insert the pandas dataframe and then join

import sqlalchemy as sa
import pandas as pd

metadata = MetaData()
sql_of_df = sa.Table(
    "##df",
    metadata,
    sa.Column("name", sa.String(x),   primary_key=True),
    sa.Column("author", sa.String(x), nullable=False),
    sa.Columnt("count", sa.Integer),
)
metadata.create_all(engine)
dataframe_dict = df.to_dict(orient='records')
insert_statement = sql_of_df.insert().values(
    {
        "name":sa.bindparam("name"),
        "author":sa.bindparam("author"),
        "count":sa.bindparam("count"),
   }
)
session.execute(insert_statement, dataframe_dict)

statement=sa.text("SELECT * from blog Inner join ##df on blog.Title = ##df.name")
session.execute(statement)

Upvotes: 0

IRSAgent
IRSAgent

Reputation: 339

I managed to do this without having to convert the dataframe to a temp table or without reading SQL into a dataframe from the blog table.

For anyone else facing the same issue, this is achieved using a virtual table of sorts.

This is what my final sql query looks like this:

>>> inner_string = "VALUES ('a','b',10), ('c','d',5), ('e','f',2)"

>>> sql_join = r"""SELECT * FROM blog
        JOIN ({0}) AS frame(title, owner, count)
        ON blog.title = frame.title
        WHERE blog.owner = frame.owner 
        ORDER BY frame.count DESC
        LIMIT 30;""".format(inner_string)

>>> res = pd.read_sql(sql_join, connection)

You can use string manipulation to convert all rows in the dataframe into one large string similar to inner_string.

Upvotes: 10

Mayank Porwal
Mayank Porwal

Reputation: 34046

You should create another dataframe from the Postgres table and then join both dataframes.

You can use read_sql to create a df from table:

import psycopg2  ## Python connector library to Postgres
import pandas as pd

conn = psycopg2.connect(...) ## Put your DB credentials here
blog_df = pd.read_sql('blog', con=conn)
## This will bring `blog` table's data into blog_df

It should look like this:

In [258]: blog_df
Out[258]: 
  title author    url
0     a      b  w.com
1     b      b  x.com
2     e      g  y.com

Now, you can join df and blog_df using merge like below:

In [261]: pd.merge(df, blog_df, left_on='name', right_on='title')
Out[261]: 
  name author_x  count title author_y    url
0    a        b     10     a        b  w.com
1    e        f      2     e        g  y.com

You will get result like above. You can clean it further.

Let me know if this helps.

Upvotes: 2

Related Questions