Reputation: 101
I have been working with the CDC FluView dataset, retrieved by this code:
library(cdcfluview)
library(ggplot2)
usflu <- get_flu_data("national", "ilinet", years=1998:2015)
What I am trying to do is create a new week variable, call it "week_new", so that the WEEK variable from this dataset is reordered. I want to reorder it by having the first week be equal to week number 30 in each year. For example, in 1998, instead of week 1 corresponding to the first week of that year, I would like week 30 to correspond to the first week of that year, and every subsequent year after that have the same scale. I am also trying to create another new variable called "season", which simply puts each week into it's corresponding flu season, say "1998-1999" for week 30 of 1998 through 1999, and so on.
I believe this involves a for loop and conditional statements, but I am not familiar with how to use these in R. I am new to programming and am learning Java and R at the same time, and have only worked with loops in Java so far.
Here is what I have tried so far, I think it's supposed to be something like this:
wk_num <- 1
for(i in nrow(usflu)){
if(week == 31){
wk_num <- 1
wk_new[i] <- wk_num
wk_num <- wk_num+1
}
if(week < 53){
season[i] <- paste(Yr[i], '-', Yr[i] +1)
}
else{
}
Any help is greatly appreciated and hopefully what I am asking makes sense. I am hoping to understand re-ordering for the future as I believe it will be an important tool for me to have at my disposal for coding in R.
Upvotes: 3
Views: 98
Reputation: 20463
Here's one way to accomplish this with the packages dplyr
and tidyr
:
library(dplyr)
library(tidyr)
usflu_df <- tbl_df(usflu)
usflu_df %>%
complete(YEAR, WEEK) %>%
filter(!(YEAR == 1998 & WEEK < 30)) %>%
mutate(season = cumsum(WEEK == 30),
season_nm = paste(1997 + season, 1998 + season, sep = "-")) %>%
group_by(season) %>%
mutate(new_wk = seq_along(season)) %>%
select(YEAR, WEEK, new_wk, season, season_nm)
# YEAR WEEK new_wk season season_nm
# (int) (int) (int) (int) (chr)
# 1 1998 30 1 1 1998-1999
# 2 1998 31 2 1 1998-1999
# 3 1998 32 3 1 1998-1999
# 4 1998 33 4 1 1998-1999
# 5 1998 34 5 1 1998-1999
# 6 1998 35 6 1 1998-1999
# 7 1998 36 7 1 1998-1999
# 8 1998 37 8 1 1998-1999
# 9 1998 38 9 1 1998-1999
# 10 1998 39 10 1 1998-1999
Talking through this...
First, use tidyr::complete
to turn implicit missing values into explicit missing values -- the original data pulled back did not have all of the weeks for 1998. Next, filter
out the irrelevant records from 1998, that is, anything with a week before 1998 and week 30 to make our lives easier. We then create two new variables, season
and season_nm
via cumsum
and a simple paste
function. The season
simply increments anytime it sees WEEK == 30
-- this is useful because of leap years. We then group_by
season
so that we can seq_along
season
to create the new_wk
variable.
Upvotes: 2