Reputation: 57
I am new to R and trying to prepare for an exam in R which will take place in one week.
On one of the homework questions, I am trying to solve a single problem in as many as ways as possible (preparing more tools always comes in handy in a time-constrained coding exam).
The problem is the following: in my dataset, "ckm_nodes.csv
"
The variable adoption date records the month in which the doctor began prescribing tetracycline, counting from November 1953. If the doctor did not begin prescribing it by month 17, i.e. February 1955, when the study ended, this is recorded as Inf. If it's not known when or if the doctor adopted tetracycline, their value is NA. Answer the following. (a) How many doctors began prescribing tetracycline in each month of the study? (b) How many never prescribed it during the study? (c) How many are NAs?
I was trying to use the aggregate( ) function to count the number of doctors starting to prescribe in each month. My base code is:
aggregate(nodes$adoption_date, by = nodes["adoption_date"], length),
which works but for the NA values.
I wonder if there is a way I can let the aggregate function count the NA values, so I read the R documentation on aggregate( ) function, which says the following:
na.action
a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.
So I googled how to solve this problem and set "na.action = NULL". However, when I try to run this code, here is what happened:
aggregate(nodes$adoption_date, by = nodes["adoption_date"], length, na.action = NULL)
Error in FUN(X[[i]], ...) : 2 arguments passed to 'length' which requires 1
Tried to move around the arguments in order:
aggregate(nodes$adoption_date, length, by = nodes["adoption_date"], na.action = NULL)
Error in FUN(X[[i]], ...) : 2 arguments passed to 'length' which requires 1
But it doesn't work either.
Any idea how to fix this?
***************** tapply()
Additionally, I was wondering if one can use the "tapply" function to solve Q1 on the homework. I tried
count <- function(data){
return(length(data$adoption_date))
}
count_tetra <- tapply(nodes,nodes$adoption_date,count)
Error in tapply(nodes, nodes$adoption_date, count) : arguments must have same length
************** loops
I am also wondering how I can use a loop to achieve the same goal.
I can start by sorting the vector:
nodes_sorted <- nodes[order(nodes$adoption_date),]
Then, write a for loop, but how...?
Goal is to get a vector count, and each element of count corresponds to a value for number of prescriptions.
Thanks!
Example data:
nodes <- data.frame( adoption_date = rep(c(1:17,NA,Inf), times = c(rep(5,17),20,3)) )
Upvotes: 2
Views: 725
Reputation: 61
Have you looked at data.table
? I believe something like this does the trick.
require(data.table)
# convert nodes to data.table
setDT(nodes)
# count occurrences for each value of adoption_rate
nodes[, .N, by = adoption_date]
Upvotes: 1