Reputation: 1592
I have dataframe similar to this:
codes=[1,3,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
values=[702,713,701,721,705,715,703,712,706,710,702,715,698,718,704]
df = pd.DataFrame(list(zip(codes, values)),
columns =['code', 'val'])
>>>
code val
0 1 702
1 3 713
2 1 701
3 3 721
4 1 705
5 3 715
6 1 703
7 3 712
8 1 706
9 3 710
10 1 702
11 3 715
12 1 698
13 3 718
14 1 704
I want to check if there is significant difference betweeen the values of group 1 and group 3. For that I have used scipy's shapiro test to check if data is normally distributed.
I did something I believe is mistake in my original code:
shapiro1=stats.shapiro(df[df['code'] == 1]
>>>
ShapiroResult(statistic=0.6468859314918518, pvalue=4.644487489713356e-05)
shapiro3=stats.shapiro(df[df['code'] == 3]
>>>
ShapiroResult(statistic=0.6508359909057617, pvalue=0.00011963312863372266)
as you can see I kind of filter the dataframe by the code and not by the values, so I insert the dataframe with one code value and two columns.
Then i did something I believe is fix:
stats.shapiro(df[df['code'] == 3]['val'])
>>>
ShapiroResult(statistic=0.967737078666687, pvalue=0.8816877007484436)
so then it is not normal distributed.
When I print the part I inserted to the shapiro:
df[df['code'] == 3]
I have dataframe with two columns, what does it check? the "codes" distribution? some mix of them?
My question here:
what does it check when I insert to two columns df to the shapiro test?
Edit: I have been able to add more columns and to run shapiro test on them (just with random numbers)
Upvotes: 1
Views: 519
Reputation: 2252
From the source on github, the first thing that happens on calling stats.shapiro()
is that the input is passed to numpy.ravel()
. This returns a view (if possible) or copy of your data as a flattened, contiguous, 1-D array.
Basically, it puts all the columns into one big, long bucket and proceeds to calculate the Shapiro-Wilk test.
Upvotes: 1