RCarmody
RCarmody

Reputation: 720

Data Analysis with Outliers

My code is supposed to return statistical analysis of approx 65 columns of data (questions from a survey). Sample data is given below, as well as the current code. Currently, the output only shows the columns that have no strings included (for the others, they return as NaN and don't even show up in the Excel).

I believe the issue is resulting from some of the data points being marked with 'No Data' and some marked with 'Outlier'

I'd like to learn a way to ignore the outlier/no data points and display statistics such as mean or median for the rest of the data. I'd also love to learn how to incorporate conditional functions to display results such as 'count of responses > 4.25' so that I can expand on the analysis.

Q1  Q2  Q3  Q4  Q5  Q6
4.758064516 4.709677419 4.629032258 Outlier 4.708994709 4.209677419
4.613821138 No Data 4.259259259 4.585774059 4.255927476 Outlier
4.136170213 4.309322034 4.272727273 4.297169811 No Data 4.29468599
4.481558803 4.581476323 4.359495445 4.558252427 4.767926491 3.829030007
4.468085106 4.446808511 4.425531915 4.446808511 4.423404255 4.14893617

Sample Desired Output (doesnt correlate to sample data): enter image description here

Code:

import pandas as pd 
from pandas import ExcelWriter

# Pull in Data
path = r"C:\Users\xx.xx\desktop\Python\PyTest\Pyxx.xlsx"
sheet = 'Adjusted Data'
data = pd.read_excel(path,sheet_name=sheet)

#Data Analysis
analysis = pd.DataFrame(data.agg(['count','min','mean', 'median', 'std']), columns=data.columns).transpose()
print(analysis)

g1 = data.groupby('INDUSTRY').median()
print(g1)
g2 = data.groupby('New Zone').median()
print(g2)

#Excel
path2 = r"C:\Users\xx.xx\desktop\Python\PyTest\Pyxx2.xlsx"
writer = ExcelWriter(path2)
g1.to_excel(writer,'x')
g2.to_excel(writer,'y')
analysis.to_excel(writer,'a')
data.to_excel(writer,'Adjusted Data')
writer.save()

EDIT Count how many of the responses to Q1 are > X (in this case, K1 = COUNTIF(K1:K999,TRUE))

enter image description here

I want this the values found in K1 & M1 (and so on for all of the questions) to be added to the analysis table like below: enter image description here enter image description here

Upvotes: 1

Views: 209

Answers (1)

Luca Angioloni
Luca Angioloni

Reputation: 2253

This happens exactly because of the Strings. Thay cannot be summed with double numbers. It is an undefined operation hence the Nan.

Try and cleanup the data.

Options are:

  • Drop the rows that contain no data or outliers if this makes sense in your statistic. (You can do that even one column at a time, computing statistics for one column at a time).
  • Substitute those values with the mean of that column (this is one of the standard procedures in statistics).
  • Think of a domain specific way to treat this kind of data.

Anyway I would try to remove the strings from the data.

If you cannot do that, it probably means that this data doesn't belong together with the rest because it comes from a different distribution.

Upvotes: 3

Related Questions