Reputation: 19
I have a dataset with 15 columns col1 to col15 being numeric. I have 100 rows of data with names attached to each row as a factor. I want to do a summary for each row for all 15 columns.
head(df2phcl[,c(1:16)])
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 col15 NAME
78 95 101 100 84 93 93 85 81 97 80 94 81 79 87 R04-001
100 61 96 75 98 92 99 99 102 83 84 NA 101 93 96 R04-002
81 84 82 83 77 86 90 92 92 78 86 91 59 80 84 R04-003
91 84 87 95 103 93 92 95 86 92 107 96 94 87 97 R04-004
72 79 66 98 84 75 85 83 75 80 91 65 90 81 73 R04-005
72 75 68 44 79 64 83 71 81 82 85 63 87 94 60 R04-006
My code for this is.
library(dplyr)
####Rachis
SUMCL <- df2phcl %>%
group_by(name) %>%
summarise(CL = mean(c(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15), na.rm=T),
CLMAX = max(c(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15), na.rm=T),
CLMIN = min(c(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15), na.rm=T),
CLSTD = sd(c(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15), na.rm=T),
OUT = outliers(c(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15), na.rm=T))
head(SUMCL)
tail(SUMCL)
My resulting analysis comes out as...
Error:
Evaluation error: missing value where TRUE/FALSE needed.
I've also tried this...
df2phcl$col1+col2+col3+col4+col5+col6+col7+col8+col9+col10+col11+col12+col13+co114+col15[!df2phcl$col1+col2+col3+col4+col5+col6+col7+col8+col9+col10+col11+col12+col13+col14+col15%in%boxplot.stats(df2phcl$col1+col2+col3+col4+col5+col6+col7+col8+col9+col10+co111+col12+col13+col14+col15)$out]
This returns ....
Error: object 'col2' not found
Not sure what I'm doing wrong this works with mean, max, min, and sd.
> head(SUMCL)
# A tibble: 6 x 11
# Groups: ENTRY, NAME, HEADCODE, RHTGENES, HEAD, PL [6]
ENTRY NAME HEADCODE RHTGENES HEAD PL PH CL CLMAX CLMIN CLSTD
<int> <fctr> <fctr> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 R04-001 CAW Rht1 Club 319 83 88.53333 101 78 7.989875
2 2 R04-002 LBW Wildtype Common 330 102 91.35714 102 61 11.770936
3 3 R04-003 CBW Rht2 Club 230 82 83.00000 92 59 8.220184
4 4 R04-004 LBW Rht1 Common 328 117 93.26667 107 84 6.192930
5 5 R04-005 CBW Rht1 Club 280 97 79.80000 98 65 9.182281
6 6 R04-006 LAW Rht1 Common 310 92 73.86667 94 44 12.749603
I'm just wanting to filter the outliers at 3 sd or more and then use the dplyr to package to do my statistics...
Upvotes: 0
Views: 437
Reputation: 19
I managed to get some of the col std. dev. changed; however, I'm not sure how many observations it took out. I was wanting to take out from the top and the bottom of the distribution at an even amount. Like a trimmed mean, it would take out 20% of the obs. from the top and bottom of the distribution. What I was curious about was just leaving the observations from the top and bottom (+-3 SD) of the distribution.
> SUMCL <- df2phcl %>%
+ gather(column, value, -c(ENTRY, NAME, HEADCODE, RHTGENES, HEAD,PL,PH)) %>% # reshape from wide to long
+ group_by(ENTRY, NAME, HEADCODE, RHTGENES, HEAD,PL,PH) %>% # summarize by name
+ mutate(value = replace(value, abs(value - mean(value)) > 2*sd(value), NA)) %>% # set outliers to NA
+ summarise(CL = mean(value, na.rm=TRUE),
+ CLMAX = max(value, na.rm=TRUE),
+ CLMIN = min(value, na.rm=TRUE),
+ N = sum(!is.na(value), na.rm=TRUE),
+ CLSTD= sd(value, na.rm=TRUE),
+ CLSE = (CLSTD / sqrt(N)))
> head(SUMCL)
# A tibble: 6 x 13
# Groups: ENTRY, NAME, HEADCODE, RHTGENES, HEAD, PL [6]
ENTRY NAME HEADCODE RHTGENES HEAD PL PH CL CLMAX CLMIN N CLSTD CLSE
<int> <fctr> <fctr> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 1 R04-001 CAW Rht1 Club 319 83 88.53333 101 78 15 7.989875 2.062977
2 2 R04-002 LBW Wildtype Common 330 102 91.35714 102 61 14 11.770936 3.145915
3 3 R04-003 CBW Rht2 Club 230 82 84.71429 92 77 14 5.029583 1.344213
4 4 R04-004 LBW Rht1 Common 328 117 92.28571 103 84 14 5.075258 1.356420
5 5 R04-005 CBW Rht1 Club 280 97 79.80000 98 65 15 9.182281 2.370855
6 6 R04-006 LAW Rht1 Common 310 92 76.00000 94 60 14 10.076629 2.693093
Upvotes: 1
Reputation: 93851
I'm not exactly sure what you're trying to do, so let me know if the code below is on the right track.
The approach below is to convert the data from wide to long format, which makes it much easier to do the summaries for each level of name
.
library(tidyverse)
# Fake data
set.seed(2)
dat = as.data.frame(replicate(15, rnorm(100)))
names(dat) = paste0("col", 1:15)
dat$name = paste0(rep(LETTERS[1:10], each=10), rep(letters[1:10], 10))
# Convert data to long format, remove outliers and summarize
dat %>%
gather(column, value, -name) %>% # reshape from wide to long
group_by(name) %>% # summarize by name
mutate(value = replace(value, abs(value - mean(value)) > 2*sd(value), NA)) %>% # set outliers to NA
summarise(mean = mean(value, na.rm=TRUE),
max = max(value, na.rm=TRUE),
sd = sd(value, na.rm=TRUE))
name mean max sd 1 Aa 0.007848188 1.238744 0.8510016 2 Ab -0.208536464 1.980401 1.2764606 3 Ac -0.152986713 1.587845 0.8443106 4 Ad -0.413543054 0.965692 0.7225872 5 Ae -0.112648322 1.178716 0.7269527 6 Af 0.442268890 2.048040 1.0350119 7 Ag 0.390627994 1.978260 0.8716681 8 Ah 0.080505879 2.396349 1.3128403 9 Ai 0.257925059 1.984474 1.0196722 10 Aj 0.137469703 1.470177 0.7192616 # ... with 90 more rows
Upvotes: 2