How to replace for loop in pandas Dataframe?

Question

I have this function, that takes dataframe with the data about articles of life expectancy in different regions and countries. I want to count the proportion of articles of each region in comparison to all articles,and also to count proportions of articles about male and female among each region. My question is how can I replace "for loop" in order to make small dataframe through the function calc_proportion? This function takes all the unique regions in Dataframe and counts proportions for each of them.

I want to have this kind of dataframe from function calc_proportion.

def calc_proportion(df):
    proportions = pd.DataFrame(columns=['Region', 'Proportion_of_all_articles', 'Proportion_male_articles', 'Proportion_female_articles', 'Proportion_bs_articles'])
    Regions = df.Region.unique()
    for region in Regions:
        a = f"{df.loc[df['Region'] == region].shape[0] / df.shape[0] : .0%}"
        b = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Male')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
        c = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Female')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
        d = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Both sexes')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
        proportions.loc[len(proportions)] = [region, a, b, c, d]
    return proportions

calc_proportion(df)

Result:

So I want to get small dataframe of proportions in 'out' without using for loop in function.

Initial data:

ouroboros1 · Accepted Answer

Minimal reproducible example

import pandas as pd
import numpy as np

np.random.seed(0) # for reproducibility
regions = ['Africa', 'Americas', 'Eastern Mediterranean', 'Europe', 
           'South_East Asia']
sexes = ['Male', 'Female', 'Both sexes']
sexes = ['Male', 'Female', 'Both sexes']

data = {'Region': np.random.choice(regions, 15),
        'Sex': np.random.choice(sexes, 15)}

df = pd.DataFrame(data)

df

                   Region         Sex
0         South_East Asia      Female
1                  Africa      Female
2                  Europe      Female
3                  Europe      Female
4                  Europe        Male
5                Americas      Female
6                  Europe        Male
7   Eastern Mediterranean        Male
8         South_East Asia      Female
9                  Africa  Both sexes
10                 Africa        Male
11        South_East Asia  Both sexes
12  Eastern Mediterranean        Male
13               Americas      Female
14                 Africa      Female

Here's one approach:

Use df.groupby on "Region" and apply groupby.value_counts with normalize parameter set to True to get distribution per region.
Next, use df.unstack to pivot the second index level (with the "sexes").
For "proportion of all articles" we need the same value_counts applied directly to df["Region"] (Series.value_counts). We use df.join to join the two results.
The rest is cosmetic:
- Add df.fillna to fill NaN values with 0.
- Add df.rename to change the column names.
- Get the columns in the desired order with df.loc, and reset the index with df.reset_index.

Code

# dict for renaming col names at end
cols_rename = {'Region': 'Proportion_of_all_articles',
               'Male': 'Proportion_male_articles',
               'Female': 'Proportion_female_articles',
               'Both sexes': 'Proportion_bs_articles'}

out = (df.groupby('Region')['Sex']
       .value_counts(normalize=True)
       .unstack('Sex')
       .join(
           df['Region'].value_counts(normalize=True)
           )
       .fillna(0)
       .rename(columns=cols_rename)
       .loc[:, cols_rename.values()]
       .reset_index(drop=False)
       )

Result

out

                  Region  Proportion_of_all_articles  \
0                 Africa                    0.266667   
1               Americas                    0.133333   
2  Eastern Mediterranean                    0.133333   
3                 Europe                    0.266667   
4        South_East Asia                    0.200000   

   Proportion_male_articles  Proportion_female_articles  \
0                      0.25                    0.500000   
1                      0.00                    1.000000   
2                      1.00                    0.000000   
3                      0.50                    0.500000   
4                      0.00                    0.666667   

   Proportion_bs_articles  
0                0.250000  
1                0.000000  
2                0.000000  
3                0.000000  
4                0.333333

Formatted result

Seeing that you are working in Jupyter Notebook, I'd suggest using df.style.format to print the result with the floats as percentages:

out.style.format({
    col: lambda x: "{: .0f}%".format(x*100) for col in out.columns if 'Proportion' in col
})

How to replace for loop in pandas Dataframe?

Answers (1)

Related Questions