user46543
user46543

Reputation: 1053

Group by with where query on Pandas Python

I have dataset consists of categorical and numerical columns. For instance: salary dataset

columns: ['job', 'country_origin', 'age', 'salary', 'degree','marital_status']

four categorical columns and two numerical columns and I want to use three aggregate functions:

cat_col = ['job', 'country_origin','degree','marital_status']
num_col = [ 'age', 'salary']
aggregate_function = ['avg','max','sum']

Currently, I have my Python code that using raw query, while my objective is to get the group-by query results from all combinations from lists above:

my query: "SELECT cat_col[0], aggregate_function[0](num_col[0]) from DB where marital_status = 'married' groub by cat_col[0]"

So queries are:

q1 = select job, avg(age) from DB where marietal_status='married' groub by job
q2 = select job, avg(salary) from DB where marietal_status='married' groub by job

etc

I used for loop to get the result from all combinations.

My problem is, I want to change that query to Pandas query. I've spent a couple of hours but could not solve it.

Pandas has a different way to querying data.

Sample dataframe:

df2 = pd.DataFrame(np.array([['programmer', 'US', 28,4000, 'master','unmarried'], 
                             ['data scientist', 'UK', 30,5000, 'PhD','unmarried'],
                             ['manager', 'US', 48,9000, 'master','married']]),
                   columns=[['job', 'country_origin', 'age', 'salary', 'degree','marital_status']])

Upvotes: 2

Views: 16084

Answers (2)

Rae
Rae

Reputation: 135

First import the libaries

import pandas as pd

Build the sample dataframe

df = pd.DataFrame( { 
    "job" : ["programmer","data scientist","manager"] , 
    "country_origin" : ["US","UK","US"],
    "age": [28,30,48],
    "salary": [4000,5000,9000],
    "degree": ["master","PhD","master"],
    "marital_status": ["unmarried","unmarried","married"]} )

apply the where clause, save as a new dataframe (not necessary, but easier to read), you can of course use the filtered df inside the groupby

married=df[df['marital_status']=='married']

q1 = select job, avg(age) from DB where marietal_status='married' group by job

married.groupby('job').agg( {"age":"mean"} )
or
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean"} )

         age
job
manager   48

q2 = select job, avg(salary) from DB where marietal_status='married' group by job

married.groupby('job').agg( {"salary":"mean"} )

         salary
job
manager    9000

You can flatten the table by resetting the index

df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean"} ).reset_index()

       job  age
0  manager   48

output the two stats together:

df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean","salary":"mean"} ).reset_index()

       job  age  salary
0  manager   48    9000

Upvotes: 2

sharder
sharder

Reputation: 141

After you create your dataframe (df), the following command builds your desired table.

df.groupby(['job', 'country_origin','degree'])[['age', 'salary']].agg([np.mean,max,sum])

Here is a complete example:

import numpy as np
import pandas as pd
df=pd.DataFrame()
df['job']=['tech','coder','admin','admin','admin','tech']
df['country_origin']=['japan','japan','US','US','India','India']
df['degree']=['cert','bs','bs','ms','bs','cert']
df['age']=[22,23,30,35,40,28]
df['salary']=[30,50,60,90,65,40]
df.groupby(['job', 'country_origin','degree'])[['age', 'salary']].agg([np.mean,max,sum])

Upvotes: 1

Related Questions