jerH
jerH

Reputation: 1299

Why does R throw an error on iterative calculation

I'm looking at covid-19 data to calculate estimates for the reproductive number R0.

library(ggplot2)  
library(dplyr)    
library(tidyr)    
library(stringr)  
library(TTR)



# Get COVID cases, available from:
url <- "https://static.usafacts.org/public/data/covid-19/covid_confirmed_usafacts.csv"

DoubleCOV <- read.csv(url, stringsAsFactors = FALSE)
names(DoubleCOV)[1] <- "countyFIPS"

DoubleCovid <- pivot_longer(DoubleCOV, cols=starts_with("X"),
                            values_to="cases",
                            names_to=c("X","date_infected"),
                            names_sep="X") %>%
  mutate(infected = as.Date(date_infected, format="%m.%d.%y"),
         countyFIPS = str_pad(as.character(countyFIPS), 5, pad="0"))

#data is by county, summarise for the state of interest
stateData <- DoubleCovid %>% filter(State == "AL") %>% filter(cases != 0) %>%  
  group_by(infected) %>% summarise(sum(cases)) %>% 
  mutate(DaysSince = infected - min(infected))

names(stateData)[2] <- "cumCases"

#3 day moving average to smooth a little
stateData <- stateData %>% mutate(MA = runMean(cumCases,3))

#calculate doubling rate (DR) and then R0 infectious period/doubling rate 
for(j in 4:nrow(stateData)){
  stateData$DR[j] <- log(2)/log(stateData$MA[j]/stateData$MA[j-1])
  stateData$R0[j] <- 14/stateData$DR[j]
}

CDplot <- stateData %>%
  ggplot(mapping = aes(x = as.numeric(DaysSince), y = R0)) +
  geom_line(color = "firebrick")

print(CDplot)

So in the above the state of interest is Alabama, hence filter(State == "AL") and this works.

But if I change the state to "NY" I get

Error in `$<-.data.frame`(`*tmp*`, "DR", value = c(NA, NA, NA, 0.733907206043719 : 
  replacement has 4 rows, data has 39

head(stateData) yields

infected   cumCases DaysSince    MA
  <date>        <int> <drtn>    <dbl>
1 2020-03-02        1 0 days    NA   
2 2020-03-03        2 1 days    NA   
3 2020-03-04       11 2 days     4.67
4 2020-03-05       23 3 days    12   
5 2020-03-06       25 4 days    19.7 
6 2020-03-07       77 5 days    41.7 

The moving average values in rows 3 and 4 (12 and 4.67) would yield a doubling rate of 0.734 which aligns with the value in the error message value = c(NA, NA, NA, 0.733907206043719 but why does it throw an error after that?

Bonus question: I know loops are frowned upon in R...is there a way to get the moving average and R0 calculation without one?

Upvotes: 1

Views: 78

Answers (1)

Edward
Edward

Reputation: 18739

You have to initialise the new variables before you can access them using the j index. Due to recycling, Alabama, which has 28 rows (divisible by 4), does not return an error, only the warnings about uninitialised columns. New York, however, has 39 rows, which is not divisible by 4 so recycling fails and R returns an error. You shouldn't ignore warnings, sometimes you can, but it's not a good idea.

Try this to see what R (you) is trying to do:

stateData[4]

You should get all rows of the 4th column, not the 4th row.

Solution: initialise your DR and R0 columns first.

stateData$DR <- NA
stateData$R0 <- NA

for(j in 4:nrow(stateData)){
  stateData$DR[j] <- log(2)/log(stateData$MA[j]/stateData$MA[j-1])
  stateData$R0[j] <- 14/stateData$DR[j]
}

For the bonus question, you can use lag in the same mutate with MA:

stateData <- stateData %>% mutate(MA = runMean(cumCases,3),
                                  DR = log(2)/log(MA/lag(MA)),
                                  R0 = 14 / DR)
stateData

# A tibble: 28 x 6
   infected   cumCases DaysSince    MA    DR    R0
   <date>        <int> <drtn>    <dbl> <dbl> <dbl>
 1 2020-03-13        5 0 days     NA   NA    NA   
 2 2020-03-14       11 1 days     NA   NA    NA   
 3 2020-03-15       22 2 days     12.7 NA    NA   
 4 2020-03-16       29 3 days     20.7  1.42  9.89
 5 2020-03-17       39 4 days     30    1.86  7.53
 6 2020-03-18       51 5 days     39.7  2.48  5.64
 7 2020-03-19       78 6 days     56    2.01  6.96
 8 2020-03-20      106 7 days     78.3  2.07  6.78
 9 2020-03-21      131 8 days    105    2.37  5.92
10 2020-03-22      167 9 days    135.   2.79  5.03
# ... with 18 more rows

I'm using Alabama's data.

Upvotes: 1

Related Questions