Reputation: 18830
I have sql query that return a data set in following format
user_id, type_id, avg
1, 3, 2.5
1, 2, 3.0
1, 5, 4.6
1, 11, 3.4
2, 2, 4.5
2, 3, 3.0
2, 11, 3.1
data above comes from following query, and it get executed is a very large table.
select u.user_id, t.type_id, sum(u.preference)/count(u.preference)
from user_preference u, item_type_pairs t
where t.item_id = u.item_id group by u.user_id, t.type_id;
Query takes 10min and returns 2 plus million records. My end goal is to put this in data frame where rows are user_id
and columns representing type_id
and each cell representing the avg
value for an item by type_id
.
type_id_1, type_id_2, type_id_3
u1| 3.0 2.5
u2| 4.5 3.0
What would be the best way to go about on this. I am also still figuring out? Should I be reading row by row and somehow populate the data frame?
Upvotes: 0
Views: 610
Reputation: 14126
I'm going to assume that you are able to create a MySql connection object, using something like:
import MySQLdb as mdb
con = mdb.connect('localhost', 'testuser', 'test623', 'testdb')
Then, getting your data into python is as simple as:
with con:
cur = con.cursor(mdb.cursors.DictCursor)
cur.execute(
"select u.user_id, t.type_id, sum(u.preference)/count(u.preference)"
"from user_preference u, item_type_pairs t"
"where t.item_id = u.item_id group by u.user_id, t.type_id;"
)
rows = cur.fetchall()
At this point rows
will look something like:
[{'user_id': 1, 'type_id': 2, 'avg': 2.5},
{'user_id': 1, 'type_id': 2, 'avg': 3.0},
...]
From this step, creating a pandas dataframe from this data is extremely simple:
import pandas as pd
import numpy as np
my_df = pd.DataFrame(rows)
Then, you can use the pivot_table
function to transform it into your desired output:
final_df = pd.pivot_table(
df,
index='user_id',
columns='type_id',
values='avg',
agg_func=np.average
)
Upvotes: 1