add-semi-colons
add-semi-colons

Reputation: 18830

How to transfer SQL Group by results to a Pandas dataframe

I have sql query that return a data set in following format

user_id, type_id, avg
1, 3, 2.5
1, 2, 3.0
1, 5, 4.6
1, 11, 3.4
2, 2, 4.5
2, 3, 3.0
2, 11, 3.1

data above comes from following query, and it get executed is a very large table.

select u.user_id, t.type_id, sum(u.preference)/count(u.preference) 
from user_preference u, item_type_pairs t 
where t.item_id = u.item_id group by u.user_id, t.type_id;

Query takes 10min and returns 2 plus million records. My end goal is to put this in data frame where rows are user_id and columns representing type_id and each cell representing the avg value for an item by type_id.

   type_id_1, type_id_2, type_id_3
u1|             3.0        2.5
u2|             4.5        3.0 

What would be the best way to go about on this. I am also still figuring out? Should I be reading row by row and somehow populate the data frame?

Upvotes: 0

Views: 610

Answers (1)

Wilduck
Wilduck

Reputation: 14126

I'm going to assume that you are able to create a MySql connection object, using something like:

import MySQLdb as mdb

con = mdb.connect('localhost', 'testuser', 'test623', 'testdb')

Then, getting your data into python is as simple as:

with con:
    cur = con.cursor(mdb.cursors.DictCursor)
    cur.execute(
        "select u.user_id, t.type_id, sum(u.preference)/count(u.preference)"
        "from user_preference u, item_type_pairs t"
        "where t.item_id = u.item_id group by u.user_id, t.type_id;"
    )
    rows = cur.fetchall()

At this point rows will look something like:

[{'user_id': 1, 'type_id': 2, 'avg': 2.5},
 {'user_id': 1, 'type_id': 2, 'avg': 3.0},
 ...]

From this step, creating a pandas dataframe from this data is extremely simple:

import pandas as pd
import numpy as np

my_df = pd.DataFrame(rows)

Then, you can use the pivot_table function to transform it into your desired output:

final_df = pd.pivot_table(
    df,
    index='user_id',
    columns='type_id',
    values='avg',
    agg_func=np.average
)

Upvotes: 1

Related Questions