RCarmody
RCarmody

Reputation: 720

Create DataFrame with Multiple Lists

I have a data frame (data) that has multiple data types (outliers were previously removed and marked with the string: Outlier). I am looking to summarize this data into a new data frame (analysis), but am running into issues when it comes to data types.

The issue that I am facing is some of the columns are descriptors (categories, names, countries, etc) and are not pulled in the numerical lists (mean, med, sd). This creates a mismatch in the number of rows in the lists (len(title) = 64, len(mean) = 61).

I'd like for the data frame to match up to all 64, with those descriptors being marked as 'NaN' for numerical fields such as mean (because I know you cant take the mean of ['Blue','Red','Yellow'])

Sample Data:

ORG|PROGRAM|YEAR|INDUSTRY|Responses|# of Questions|New Zone|Q1|Q2
USA|MO|2018|PRD - LF|64|44|High|4.75806451612903|4.70967741935484
CAN|ALB|2017|FS - B|247|43|Medium|4.61382113821138|4.66803278688525
UK|IRE|2018|RES - U|236|46|Low|4.13617021276596|4.30932203389831

enter image description here

Code:

title = list(data.keys())
n = list(data.count())
mean = list(data.mean())
med = list(data.median())
sd = list(data.std())

analysis = pd.DataFrame({'Mean':mean,'Median':med,'SD':sd})
print(analysis)

Current Output:

enter image description here

Desired Output: Additional rows should be shown with the NaN value if no numerical values exist (i.e. if it's a category or country). This would increase the amount of rows to 64 rather than 61, and allow for the additional columns to be added (Title, count, etc)

Insert

Upvotes: 0

Views: 1806

Answers (2)

Jondiedoop
Jondiedoop

Reputation: 3353

If I understand correctly, you could combine using agg with the DataFrame-constructor where you add which columns you want to add. Finally, transpose to switch rows and columns:

pd.DataFrame(df.agg(['mean', 'median', 'std']), columns=df.columns).transpose()
# Output
#                    mean       median         std
#ORG                     NaN          NaN         NaN
#PROGRAM                 NaN          NaN         NaN
#YEAR            2017.666667  2018.000000    0.577350
#INDUSTRY                NaN          NaN         NaN
#Responses        182.333333   236.000000  102.627157
## of Questions    44.333333    44.000000    1.527525
#New Zone                NaN          NaN         NaN
#Q1                 4.502685     4.613821    0.325502
#Q2                 4.562344     4.668033    0.220111

EDIT You can also write custom function with which to aggregate the columns, e.g.

def more_than_4(col):
'''Count nr of entries larger than 4'''
    return (col > 4).sum()
df.agg(['mean', more_than_4])

#Output
#                        YEAR   Responses    ...           Q1        Q2
#mean         2017.666667  182.333333    ...     4.502685  4.562344
#more_than_4     3.000000    3.000000    ...     3.000000  3.000000

Upvotes: 1

WGS
WGS

Reputation: 14179

This is actually quite simple.

import pandas as pd
df = pd.read_csv('database.txt', sep='|') # Read data.
print(df.describe(include='all').T) # Just force .describe() to show all columns then transpose.

Should output:

               count unique       top freq     mean       std      min      25%      50%      75%      max
ORG                3      3        UK    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
PROGRAM            3      3       IRE    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
YEAR               3    NaN       NaN  NaN  2017.67   0.57735     2017   2017.5     2018     2018     2018
INDUSTRY           3      3  PRD - LF    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
Responses          3    NaN       NaN  NaN  182.333   102.627       64      150      236    241.5      247
# of Questions     3    NaN       NaN  NaN  44.3333   1.52753       43     43.5       44       45       46
New Zone           3      3       Low    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
Q1                 3    NaN       NaN  NaN  4.50269  0.325502  4.13617    4.375  4.61382  4.68594  4.75806
Q2                 3    NaN       NaN  NaN  4.56234  0.220111  4.30932  4.48868  4.66803  4.68886  4.70968

Median should be the 50% column.

EDIT:

If there's a risk of columns being inserted with strings and messing up the values/dtypes, just coerce them to numeric.

import pandas as pd
df = pd.read_csv('database.txt', sep='|')
main_cols = ['Q1', 'Q2']
df[main_cols] = df[main_cols].apply(pd.to_numeric, errors='coerce')
print(df.describe(include='all').T)

Tested with the following data (note the BLAHHHHHHH value in last row):

ORG|PROGRAM|YEAR|INDUSTRY|Responses|# of Questions|New Zone|Q1|Q2
USA|MO|2018|PRD - LF|64|44|High|4.75806451612903|4.70967741935484
CAN|ALB|2017|FS - B|247|43|Medium|4.61382113821138|4.66803278688525
UK|IRE|2018|RES - U|236|46|Low|BLAHHHHHHH|4.30932203389831

Output is:

               count unique     top freq     mean       std      min      25%      50%      75%      max
ORG                3      3     CAN    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
PROGRAM            3      3     IRE    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
YEAR               3    NaN     NaN  NaN  2017.67   0.57735     2017   2017.5     2018     2018     2018
INDUSTRY           3      3  FS - B    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
Responses          3    NaN     NaN  NaN  182.333   102.627       64      150      236    241.5      247
# of Questions     3    NaN     NaN  NaN  44.3333   1.52753       43     43.5       44       45       46
New Zone           3      3     Low    1      NaN       NaN      NaN      NaN      NaN      NaN      NaN
Q1                 2    NaN     NaN  NaN  4.68594  0.101995  4.61382  4.64988  4.68594    4.722  4.75806
Q2                 3    NaN     NaN  NaN  4.56234  0.220111  4.30932  4.48868  4.66803  4.68886  4.70968

Upvotes: 1

Related Questions