Create DataFrame with Multiple Lists

Question

I have a data frame (data) that has multiple data types (outliers were previously removed and marked with the string: Outlier). I am looking to summarize this data into a new data frame (analysis), but am running into issues when it comes to data types.

The issue that I am facing is some of the columns are descriptors (categories, names, countries, etc) and are not pulled in the numerical lists (mean, med, sd). This creates a mismatch in the number of rows in the lists (len(title) = 64, len(mean) = 61).

I'd like for the data frame to match up to all 64, with those descriptors being marked as 'NaN' for numerical fields such as mean (because I know you cant take the mean of ['Blue','Red','Yellow'])

Sample Data:

ORG|PROGRAM|YEAR|INDUSTRY|Responses|# of Questions|New Zone|Q1|Q2
USA|MO|2018|PRD - LF|64|44|High|4.75806451612903|4.70967741935484
CAN|ALB|2017|FS - B|247|43|Medium|4.61382113821138|4.66803278688525
UK|IRE|2018|RES - U|236|46|Low|4.13617021276596|4.30932203389831

Code:

title = list(data.keys())
n = list(data.count())
mean = list(data.mean())
med = list(data.median())
sd = list(data.std())

analysis = pd.DataFrame({'Mean':mean,'Median':med,'SD':sd})
print(analysis)

Current Output:

Desired Output: Additional rows should be shown with the NaN value if no numerical values exist (i.e. if it's a category or country). This would increase the amount of rows to 64 rather than 61, and allow for the additional columns to be added (Title, count, etc)

Jondiedoop · Accepted Answer

If I understand correctly, you could combine using agg with the DataFrame-constructor where you add which columns you want to add. Finally, transpose to switch rows and columns:

pd.DataFrame(df.agg(['mean', 'median', 'std']), columns=df.columns).transpose()
# Output
#                    mean       median         std
#ORG                     NaN          NaN         NaN
#PROGRAM                 NaN          NaN         NaN
#YEAR            2017.666667  2018.000000    0.577350
#INDUSTRY                NaN          NaN         NaN
#Responses        182.333333   236.000000  102.627157
## of Questions    44.333333    44.000000    1.527525
#New Zone                NaN          NaN         NaN
#Q1                 4.502685     4.613821    0.325502
#Q2                 4.562344     4.668033    0.220111

EDIT You can also write custom function with which to aggregate the columns, e.g.

def more_than_4(col):
'''Count nr of entries larger than 4'''
    return (col > 4).sum()
df.agg(['mean', more_than_4])

#Output
#                        YEAR   Responses    ...           Q1        Q2
#mean         2017.666667  182.333333    ...     4.502685  4.562344
#more_than_4     3.000000    3.000000    ...     3.000000  3.000000

Create DataFrame with Multiple Lists

Answers (2)

Related Questions