Reputation: 720
I have a data frame (data) that has multiple data types (outliers were previously removed and marked with the string: Outlier). I am looking to summarize this data into a new data frame (analysis), but am running into issues when it comes to data types.
The issue that I am facing is some of the columns are descriptors (categories, names, countries, etc) and are not pulled in the numerical lists (mean, med, sd). This creates a mismatch in the number of rows in the lists (len(title) = 64, len(mean) = 61).
I'd like for the data frame to match up to all 64, with those descriptors being marked as 'NaN' for numerical fields such as mean (because I know you cant take the mean of ['Blue','Red','Yellow'])
Sample Data:
ORG|PROGRAM|YEAR|INDUSTRY|Responses|# of Questions|New Zone|Q1|Q2
USA|MO|2018|PRD - LF|64|44|High|4.75806451612903|4.70967741935484
CAN|ALB|2017|FS - B|247|43|Medium|4.61382113821138|4.66803278688525
UK|IRE|2018|RES - U|236|46|Low|4.13617021276596|4.30932203389831
Code:
title = list(data.keys())
n = list(data.count())
mean = list(data.mean())
med = list(data.median())
sd = list(data.std())
analysis = pd.DataFrame({'Mean':mean,'Median':med,'SD':sd})
print(analysis)
Current Output:
Desired Output: Additional rows should be shown with the NaN value if no numerical values exist (i.e. if it's a category or country). This would increase the amount of rows to 64 rather than 61, and allow for the additional columns to be added (Title, count, etc)
Upvotes: 0
Views: 1806
Reputation: 3353
If I understand correctly, you could combine using agg
with the DataFrame
-constructor where you add which columns you want to add. Finally, transpose
to switch rows and columns:
pd.DataFrame(df.agg(['mean', 'median', 'std']), columns=df.columns).transpose()
# Output
# mean median std
#ORG NaN NaN NaN
#PROGRAM NaN NaN NaN
#YEAR 2017.666667 2018.000000 0.577350
#INDUSTRY NaN NaN NaN
#Responses 182.333333 236.000000 102.627157
## of Questions 44.333333 44.000000 1.527525
#New Zone NaN NaN NaN
#Q1 4.502685 4.613821 0.325502
#Q2 4.562344 4.668033 0.220111
EDIT You can also write custom function with which to aggregate the columns, e.g.
def more_than_4(col):
'''Count nr of entries larger than 4'''
return (col > 4).sum()
df.agg(['mean', more_than_4])
#Output
# YEAR Responses ... Q1 Q2
#mean 2017.666667 182.333333 ... 4.502685 4.562344
#more_than_4 3.000000 3.000000 ... 3.000000 3.000000
Upvotes: 1
Reputation: 14179
This is actually quite simple.
import pandas as pd
df = pd.read_csv('database.txt', sep='|') # Read data.
print(df.describe(include='all').T) # Just force .describe() to show all columns then transpose.
Should output:
count unique top freq mean std min 25% 50% 75% max
ORG 3 3 UK 1 NaN NaN NaN NaN NaN NaN NaN
PROGRAM 3 3 IRE 1 NaN NaN NaN NaN NaN NaN NaN
YEAR 3 NaN NaN NaN 2017.67 0.57735 2017 2017.5 2018 2018 2018
INDUSTRY 3 3 PRD - LF 1 NaN NaN NaN NaN NaN NaN NaN
Responses 3 NaN NaN NaN 182.333 102.627 64 150 236 241.5 247
# of Questions 3 NaN NaN NaN 44.3333 1.52753 43 43.5 44 45 46
New Zone 3 3 Low 1 NaN NaN NaN NaN NaN NaN NaN
Q1 3 NaN NaN NaN 4.50269 0.325502 4.13617 4.375 4.61382 4.68594 4.75806
Q2 3 NaN NaN NaN 4.56234 0.220111 4.30932 4.48868 4.66803 4.68886 4.70968
Median should be the 50% column.
EDIT:
If there's a risk of columns being inserted with strings and messing up the values/dtypes, just coerce them to numeric.
import pandas as pd
df = pd.read_csv('database.txt', sep='|')
main_cols = ['Q1', 'Q2']
df[main_cols] = df[main_cols].apply(pd.to_numeric, errors='coerce')
print(df.describe(include='all').T)
Tested with the following data (note the BLAHHHHHHH
value in last row):
ORG|PROGRAM|YEAR|INDUSTRY|Responses|# of Questions|New Zone|Q1|Q2
USA|MO|2018|PRD - LF|64|44|High|4.75806451612903|4.70967741935484
CAN|ALB|2017|FS - B|247|43|Medium|4.61382113821138|4.66803278688525
UK|IRE|2018|RES - U|236|46|Low|BLAHHHHHHH|4.30932203389831
Output is:
count unique top freq mean std min 25% 50% 75% max
ORG 3 3 CAN 1 NaN NaN NaN NaN NaN NaN NaN
PROGRAM 3 3 IRE 1 NaN NaN NaN NaN NaN NaN NaN
YEAR 3 NaN NaN NaN 2017.67 0.57735 2017 2017.5 2018 2018 2018
INDUSTRY 3 3 FS - B 1 NaN NaN NaN NaN NaN NaN NaN
Responses 3 NaN NaN NaN 182.333 102.627 64 150 236 241.5 247
# of Questions 3 NaN NaN NaN 44.3333 1.52753 43 43.5 44 45 46
New Zone 3 3 Low 1 NaN NaN NaN NaN NaN NaN NaN
Q1 2 NaN NaN NaN 4.68594 0.101995 4.61382 4.64988 4.68594 4.722 4.75806
Q2 3 NaN NaN NaN 4.56234 0.220111 4.30932 4.48868 4.66803 4.68886 4.70968
Upvotes: 1