Reputation: 720
My code is supposed to return statistical analysis of approx 65 columns of data (questions from a survey). Sample data is given below, as well as the current code. Currently, the output only shows the columns that have no strings included (for the others, they return as NaN and don't even show up in the Excel).
I believe the issue is resulting from some of the data points being marked with 'No Data' and some marked with 'Outlier'
I'd like to learn a way to ignore the outlier/no data points and display statistics such as mean or median for the rest of the data. I'd also love to learn how to incorporate conditional functions to display results such as 'count of responses > 4.25' so that I can expand on the analysis.
Q1 Q2 Q3 Q4 Q5 Q6
4.758064516 4.709677419 4.629032258 Outlier 4.708994709 4.209677419
4.613821138 No Data 4.259259259 4.585774059 4.255927476 Outlier
4.136170213 4.309322034 4.272727273 4.297169811 No Data 4.29468599
4.481558803 4.581476323 4.359495445 4.558252427 4.767926491 3.829030007
4.468085106 4.446808511 4.425531915 4.446808511 4.423404255 4.14893617
Sample Desired Output (doesnt correlate to sample data):
Code:
import pandas as pd
from pandas import ExcelWriter
# Pull in Data
path = r"C:\Users\xx.xx\desktop\Python\PyTest\Pyxx.xlsx"
sheet = 'Adjusted Data'
data = pd.read_excel(path,sheet_name=sheet)
#Data Analysis
analysis = pd.DataFrame(data.agg(['count','min','mean', 'median', 'std']), columns=data.columns).transpose()
print(analysis)
g1 = data.groupby('INDUSTRY').median()
print(g1)
g2 = data.groupby('New Zone').median()
print(g2)
#Excel
path2 = r"C:\Users\xx.xx\desktop\Python\PyTest\Pyxx2.xlsx"
writer = ExcelWriter(path2)
g1.to_excel(writer,'x')
g2.to_excel(writer,'y')
analysis.to_excel(writer,'a')
data.to_excel(writer,'Adjusted Data')
writer.save()
EDIT Count how many of the responses to Q1 are > X (in this case, K1 = COUNTIF(K1:K999,TRUE))
I want this the values found in K1 & M1 (and so on for all of the questions) to be added to the analysis table like below:
Upvotes: 1
Views: 209
Reputation: 2253
This happens exactly because of the Strings. Thay cannot be summed with double numbers. It is an undefined operation hence the Nan.
Try and cleanup the data.
Options are:
Anyway I would try to remove the strings from the data.
If you cannot do that, it probably means that this data doesn't belong together with the rest because it comes from a different distribution.
Upvotes: 3