Is there a way to turn this hulk of code into an R function?

Question

I'm not extremely adept with R and recently I've been trying to learn how to write functions well. So I have a section of code that if I write out in a "non-function" way, it will end up being over a thousand lines of code. The problem is that it's really only about six lines of "unique" code but its running on different subsets of a large data set.

df <- subset(data, FileName == "File Name" & Category == "Category Name" & Case == "Case Name")
df <- df %>% group_by(TestNum) %>% summarise(FileName = FileName[1], Version = Version[1], Measure = Measure[1], RepMean = mean(Value), Case = Case[1])
df <- df[c(2, 3, 4, 5, 1, 6)]
df$Sigma1 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + sd(df$RepMean, na.rm=TRUE))|(df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - sd(df$RepMean, na.rm=TRUE))
df$Sigma2 <- (df$RepMean > (mean(df$RepMean,na.rm=TRUE)) + 2 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 2 * (sd(df$RepMean, na.rm=TRUE)))
df$Sigma3 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + 3 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 3 * (sd(df$RepMean, na.rm=TRUE)))

The original data set has 6 unique values in the FileName column, 7 in the Category column, and 4 in the Case column which means that I'm creating 168 unique df data frames with these lines of code that I'm using rbind.fill on to create a single data frame ("StatTable") that I then run this on:

LatestTestNum <- max(data$TestNum, na.rm=TRUE)
ControlTable <- subset(StatTable, (Sigma1 == "TRUE" | Sigma2 == "TRUE" | Sigma3 == "TRUE") & TestNum == LatestTestNum)
ControlTable <- ControlTable[, c("FileName, "Category", "Case", "Sigma1", "Sigma2", "Sigma3")]

Where ControlTable is the end product I'm looking for.

So is this something where a function would greatly reduce my tedious pain? Especially when I want to modify the way this works, it requires changing every single df code by hand.

EDIT: Here's an explanation of what is found in each column of the original data set.

Column1:FileName -- The name of the file that the data comes from
Column2:Version -- The version of the software that data comes from
Column3:Category -- The particular data type measured
Column4:Value -- The value of the data
Column5:TestNum -- TestNum is an integer value associated with the version number. This makes it easier to organize and sort data rather than using the Version column which is a string. (So for example version 1.0 might be TestNum=1 and 1.1 TestNum=2)
Column6:RepNum -- The replication count of that version. (Files are run multiple times per version)
Column7:Case -- There are different ways that the software is "setup" for data collection.

Here's a working data set.

FileName <- c("File1","File1","File1","File1","File2","File2","File2","File2","File1","File1","File1","File1","File2","File2","File2","File2","File1","File1","File1","File1","File2","File2","File2","File2","File1","File1","File1","File1","File2","File2","File2","File2")
Version <- c("1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2")
Category <- c("Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2")
Value <- rpois(n = 32, lambda = 100)
TestNum <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
RepNum <- c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2)
Case <- c("Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2")
df <- data.frame(FileName,Version,Category,Value,TestNum,RepNum,Case)

2nd EDIT: I've provided an answer myself because I've worked out a function that will do the basic steps that lead to the various "df"'s with unique "FileName", "Category", and "Case". I still would like to be able to fold in the need to rbind 168 different data frames but the primary thing I'd also like to be able to add into this function is the ability to filter out certain TestNum's easily.

So for example, one of my unique data frames works best with this subset:

df <- subset(data, FileName == "File1" & Category == "Category1" & Case == "Case1" &
TestNum > 11)

but another data frame might work best with this subset:

df <- subset(data, FileName == "File1" & Category == "Category1" & Case == "Case1" &
(TestNum > 8 & TestNum != 21 & TestNum != 32))

I figure I should be able to add "TestNum" as another argument to my function but I'm not sure how I could allow for this much control on how much I can filter.

Fun extra challenge:

Because I'm working with a large existing data set, I've tweaked each sigma value so that certain data points are filtered out of the mean and standard deviation calculations (this is necessary to actually be any use in detecting new data points that fall outside of these sigma values -- the entire purpose of this code). Is there a way to write a function that also allows for this same tweaking?

Aaron - mostly inactive · Accepted Answer

For the 168 unique FileName/Category/Case combinations, it would seem entirely natural to use dplyr the way you do within your function. First group by FileName/Category/Case/TestNum and get your RepMeans, then group by FileName/Category/Case and do the calculations to get if it is 1, 2, or 3 SD's away. Rather than your comparison code, here I compute the number of SDs away first and then use that, which feels more natural and also repeats the computations less.

df %>% group_by(FileName, Category, Case, TestNum) %>%
       summarise(RepMean = mean(Value)) %>%
       group_by(FileName, Category, Case) %>%
       mutate(diff.sd = abs((RepMean - mean(RepMean, na.rm=TRUE))/sd(RepMean, na.rm=TRUE)),
              Sigma1 = diff.sd > 1,
              Sigma2 = diff.sd > 2,
              Sigma3 = diff.sd > 3)

For your additional subsetting, that seems most natural to me to think of it as simply removing the rows you don't want, rather than including the one you do. Once you've removed them from the full data set, you can just run this code.

EDIT to demonstrate output: Here I show output with sigma values of 1+, 2+, and 3+, by adding outliers to your original data, and also adapting it to have more data points in each group and only one in each TestNum. In this version, I also output the group mean, sd, and size to be sure it's all working correctly.

df$Case <- "Case1"
df$Category <- "Category1"
df$TestNum <- 1:nrow(df)
df$Value[1] <- 5000
df$Value[5] <- 140
out <- df %>% group_by(FileName, Category, Case, TestNum) %>%
       summarise(RepMean = mean(Value)) %>%
       group_by(FileName, Category, Case) %>%
           mutate(group.mean=mean(RepMean, na.rm=TRUE),
                  group.sd=sd(RepMean, na.rm=TRUE),
                  group.n=length(RepMean),
                  diff.sd = abs((RepMean - mean(RepMean, na.rm=TRUE))/sd(RepMean, na.rm=TRUE)),
              Sigma1 = diff.sd > 1,
              Sigma2 = diff.sd > 2,
              Sigma3 = diff.sd > 3)
head(out[order(-out$diff.sd),])
## Source: local data frame [6 x 12]
## Groups: FileName, Category, Case [2]
##  
##   FileName  Category  Case TestNum RepMean group.mean   group.sd group.n   diff.sd Sigma1 Sigma2 Sigma3
##                                           
## 1    File1 Category1 Case1       1    5000   402.4375 1226.06857      16 3.7498413   TRUE   TRUE   TRUE
## 2    File2 Category1 Case1       5     140   103.5625   13.29646      16 2.7403912   TRUE   TRUE  FALSE
## 3    File2 Category1 Case1      13      85   103.5625   13.29646      16 1.3960483   TRUE  FALSE  FALSE
## 4    File2 Category1 Case1      15     118   103.5625   13.29646      16 1.0858154   TRUE  FALSE  FALSE
## 5    File2 Category1 Case1      16      90   103.5625   13.29646      16 1.0200084   TRUE  FALSE  FALSE
## 6    File2 Category1 Case1      14      91   103.5625   13.29646      16 0.9448004  FALSE  FALSE  FALSE

Is there a way to turn this hulk of code into an R function?

Fun extra challenge:

Answers (2)

Related Questions