Reputation: 1053
I have dataset consists of categorical and numerical columns. For instance: salary dataset
columns: ['job', 'country_origin', 'age', 'salary', 'degree','marital_status']
four categorical columns and two numerical columns and I want to use three aggregate functions:
cat_col = ['job', 'country_origin','degree','marital_status']
num_col = [ 'age', 'salary']
aggregate_function = ['avg','max','sum']
Currently, I have my Python code that using raw query, while my objective is to get the group-by query results from all combinations from lists above:
my query: "SELECT cat_col[0], aggregate_function[0](num_col[0]) from DB where marital_status = 'married' groub by cat_col[0]"
So queries are:
q1 = select job, avg(age) from DB where marietal_status='married' groub by job
q2 = select job, avg(salary) from DB where marietal_status='married' groub by job
etc
I used for loop
to get the result from all combinations.
My problem is, I want to change that query to Pandas query. I've spent a couple of hours but could not solve it.
Pandas has a different way to querying data.
Sample dataframe:
df2 = pd.DataFrame(np.array([['programmer', 'US', 28,4000, 'master','unmarried'],
['data scientist', 'UK', 30,5000, 'PhD','unmarried'],
['manager', 'US', 48,9000, 'master','married']]),
columns=[['job', 'country_origin', 'age', 'salary', 'degree','marital_status']])
Upvotes: 2
Views: 16084
Reputation: 135
First import the libaries
import pandas as pd
Build the sample dataframe
df = pd.DataFrame( {
"job" : ["programmer","data scientist","manager"] ,
"country_origin" : ["US","UK","US"],
"age": [28,30,48],
"salary": [4000,5000,9000],
"degree": ["master","PhD","master"],
"marital_status": ["unmarried","unmarried","married"]} )
apply the where clause, save as a new dataframe (not necessary, but easier to read), you can of course use the filtered df inside the groupby
married=df[df['marital_status']=='married']
q1 = select job, avg(age) from DB where marietal_status='married' group by job
married.groupby('job').agg( {"age":"mean"} )
or
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean"} )
age
job
manager 48
q2 = select job, avg(salary) from DB where marietal_status='married' group by job
married.groupby('job').agg( {"salary":"mean"} )
salary
job
manager 9000
You can flatten the table by resetting the index
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean"} ).reset_index()
job age
0 manager 48
output the two stats together:
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean","salary":"mean"} ).reset_index()
job age salary
0 manager 48 9000
Upvotes: 2
Reputation: 141
After you create your dataframe (df), the following command builds your desired table.
df.groupby(['job', 'country_origin','degree'])[['age', 'salary']].agg([np.mean,max,sum])
Here is a complete example:
import numpy as np
import pandas as pd
df=pd.DataFrame()
df['job']=['tech','coder','admin','admin','admin','tech']
df['country_origin']=['japan','japan','US','US','India','India']
df['degree']=['cert','bs','bs','ms','bs','cert']
df['age']=[22,23,30,35,40,28]
df['salary']=[30,50,60,90,65,40]
df.groupby(['job', 'country_origin','degree'])[['age', 'salary']].agg([np.mean,max,sum])
Upvotes: 1