Reputation: 199
I'm not extremely adept with R and recently I've been trying to learn how to write functions well. So I have a section of code that if I write out in a "non-function" way, it will end up being over a thousand lines of code. The problem is that it's really only about six lines of "unique" code but its running on different subsets of a large data set.
df <- subset(data, FileName == "File Name" & Category == "Category Name" & Case == "Case Name")
df <- df %>% group_by(TestNum) %>% summarise(FileName = FileName[1], Version = Version[1], Measure = Measure[1], RepMean = mean(Value), Case = Case[1])
df <- df[c(2, 3, 4, 5, 1, 6)]
df$Sigma1 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + sd(df$RepMean, na.rm=TRUE))|(df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - sd(df$RepMean, na.rm=TRUE))
df$Sigma2 <- (df$RepMean > (mean(df$RepMean,na.rm=TRUE)) + 2 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 2 * (sd(df$RepMean, na.rm=TRUE)))
df$Sigma3 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + 3 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 3 * (sd(df$RepMean, na.rm=TRUE)))
The original data set has 6 unique values in the FileName
column, 7 in the Category
column, and 4 in the Case
column which means that I'm creating 168 unique df
data frames with these lines of code that I'm using rbind.fill
on to create a single data frame ("StatTable") that I then run this on:
LatestTestNum <- max(data$TestNum, na.rm=TRUE)
ControlTable <- subset(StatTable, (Sigma1 == "TRUE" | Sigma2 == "TRUE" | Sigma3 == "TRUE") & TestNum == LatestTestNum)
ControlTable <- ControlTable[, c("FileName, "Category", "Case", "Sigma1", "Sigma2", "Sigma3")]
Where ControlTable
is the end product I'm looking for.
So is this something where a function would greatly reduce my tedious pain? Especially when I want to modify the way this works, it requires changing every single df code by hand.
EDIT: Here's an explanation of what is found in each column of the original data set.
Column1:FileName -- The name of the file that the data comes from
Column2:Version -- The version of the software that data comes from
Column3:Category -- The particular data type measured
Column4:Value -- The value of the data
Column5:TestNum -- TestNum is an integer value associated with the version number. This makes it easier to organize and sort data rather than using the Version column which is a string. (So for example version 1.0 might be TestNum=1 and 1.1 TestNum=2)
Column6:RepNum -- The replication count of that version. (Files are run multiple times per version)
Column7:Case -- There are different ways that the software is "setup" for data collection.
Here's a working data set.
FileName <- c("File1","File1","File1","File1","File2","File2","File2","File2","File1","File1","File1","File1","File2","File2","File2","File2","File1","File1","File1","File1","File2","File2","File2","File2","File1","File1","File1","File1","File2","File2","File2","File2")
Version <- c("1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2")
Category <- c("Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2")
Value <- rpois(n = 32, lambda = 100)
TestNum <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
RepNum <- c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2)
Case <- c("Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2")
df <- data.frame(FileName,Version,Category,Value,TestNum,RepNum,Case)
2nd EDIT: I've provided an answer myself because I've worked out a function that will do the basic steps that lead to the various "df"'s with unique "FileName", "Category", and "Case". I still would like to be able to fold in the need to rbind 168 different data frames but the primary thing I'd also like to be able to add into this function is the ability to filter out certain TestNum
's easily.
So for example, one of my unique data frames works best with this subset:
df <- subset(data, FileName == "File1" & Category == "Category1" & Case == "Case1" &
TestNum > 11)
but another data frame might work best with this subset:
df <- subset(data, FileName == "File1" & Category == "Category1" & Case == "Case1" &
(TestNum > 8 & TestNum != 21 & TestNum != 32))
I figure I should be able to add "TestNum" as another argument to my function but I'm not sure how I could allow for this much control on how much I can filter.
Because I'm working with a large existing data set, I've tweaked each sigma value so that certain data points are filtered out of the mean and standard deviation calculations (this is necessary to actually be any use in detecting new data points that fall outside of these sigma values -- the entire purpose of this code). Is there a way to write a function that also allows for this same tweaking?
Upvotes: 1
Views: 99
Reputation: 37824
For the 168 unique FileName/Category/Case combinations, it would seem entirely natural to use dplyr
the way you do within your function. First group by FileName/Category/Case/TestNum and get your RepMeans, then group by FileName/Category/Case and do the calculations to get if it is 1, 2, or 3 SD's away. Rather than your comparison code, here I compute the number of SDs away first and then use that, which feels more natural and also repeats the computations less.
df %>% group_by(FileName, Category, Case, TestNum) %>%
summarise(RepMean = mean(Value)) %>%
group_by(FileName, Category, Case) %>%
mutate(diff.sd = abs((RepMean - mean(RepMean, na.rm=TRUE))/sd(RepMean, na.rm=TRUE)),
Sigma1 = diff.sd > 1,
Sigma2 = diff.sd > 2,
Sigma3 = diff.sd > 3)
For your additional subsetting, that seems most natural to me to think of it as simply removing the rows you don't want, rather than including the one you do. Once you've removed them from the full data set, you can just run this code.
EDIT to demonstrate output: Here I show output with sigma values of 1+, 2+, and 3+, by adding outliers to your original data, and also adapting it to have more data points in each group and only one in each TestNum. In this version, I also output the group mean, sd, and size to be sure it's all working correctly.
df$Case <- "Case1"
df$Category <- "Category1"
df$TestNum <- 1:nrow(df)
df$Value[1] <- 5000
df$Value[5] <- 140
out <- df %>% group_by(FileName, Category, Case, TestNum) %>%
summarise(RepMean = mean(Value)) %>%
group_by(FileName, Category, Case) %>%
mutate(group.mean=mean(RepMean, na.rm=TRUE),
group.sd=sd(RepMean, na.rm=TRUE),
group.n=length(RepMean),
diff.sd = abs((RepMean - mean(RepMean, na.rm=TRUE))/sd(RepMean, na.rm=TRUE)),
Sigma1 = diff.sd > 1,
Sigma2 = diff.sd > 2,
Sigma3 = diff.sd > 3)
head(out[order(-out$diff.sd),])
## Source: local data frame [6 x 12]
## Groups: FileName, Category, Case [2]
##
## FileName Category Case TestNum RepMean group.mean group.sd group.n diff.sd Sigma1 Sigma2 Sigma3
## <fctr> <chr> <chr> <int> <dbl> <dbl> <dbl> <int> <dbl> <lgl> <lgl> <lgl>
## 1 File1 Category1 Case1 1 5000 402.4375 1226.06857 16 3.7498413 TRUE TRUE TRUE
## 2 File2 Category1 Case1 5 140 103.5625 13.29646 16 2.7403912 TRUE TRUE FALSE
## 3 File2 Category1 Case1 13 85 103.5625 13.29646 16 1.3960483 TRUE FALSE FALSE
## 4 File2 Category1 Case1 15 118 103.5625 13.29646 16 1.0858154 TRUE FALSE FALSE
## 5 File2 Category1 Case1 16 90 103.5625 13.29646 16 1.0200084 TRUE FALSE FALSE
## 6 File2 Category1 Case1 14 91 103.5625 13.29646 16 0.9448004 FALSE FALSE FALSE
Upvotes: 1
Reputation: 199
Here's a simple function that I've worked out:
myFunction <- function(df,FileNameStr,CategoryStr,CaseStr){
df <- subset(df, FileName == FileNameStr & Category == CategoryStr & Case == CaseStr)
df <- df %>% group_by(TestNum) %>% summarise(FileName = FileName[1], Version = Version[1], Category = Category[1], RepMean = mean(Value), Case = Case[1])
df <- df[c(2, 3, 4, 5, 1, 6)]
df$Sigma1 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + sd(df$RepMean, na.rm=TRUE))|(df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - sd(df$RepMean, na.rm=TRUE))
df$Sigma2 <- (df$RepMean > (mean(df$RepMean,na.rm=TRUE)) + 2 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 2 * (sd(df$RepMean, na.rm=TRUE)))
df$Sigma3 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + 3 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 3 * (sd(df$RepMean, na.rm=TRUE)))
return(df)
}
Optimally, I'd love to somehow work in the rbind.fill
bit into this. I feel like there must be a quicker way to do that than manually type out 168 data frame names into the function. Nonetheless, this is the function for creating each of those data frames.
EDIT: I've made an edit (under "2nd EDIT") to my original question that explains more about my need to be able to add filtering capabilities to this function.
Upvotes: 0