Nathan
Nathan

Reputation: 101

How to re-order data in R, and creating a new variable for the data?

I have been working with the CDC FluView dataset, retrieved by this code:

    library(cdcfluview)
    library(ggplot2)
    usflu <- get_flu_data("national", "ilinet", years=1998:2015)

What I am trying to do is create a new week variable, call it "week_new", so that the WEEK variable from this dataset is reordered. I want to reorder it by having the first week be equal to week number 30 in each year. For example, in 1998, instead of week 1 corresponding to the first week of that year, I would like week 30 to correspond to the first week of that year, and every subsequent year after that have the same scale. I am also trying to create another new variable called "season", which simply puts each week into it's corresponding flu season, say "1998-1999" for week 30 of 1998 through 1999, and so on.

I believe this involves a for loop and conditional statements, but I am not familiar with how to use these in R. I am new to programming and am learning Java and R at the same time, and have only worked with loops in Java so far.

Here is what I have tried so far, I think it's supposed to be something like this:

    wk_num <- 1
    for(i in nrow(usflu)){
      if(week == 31){
        wk_num <- 1
        wk_new[i] <- wk_num
        wk_num <- wk_num+1
        }
      if(week < 53){
        season[i] <- paste(Yr[i], '-', Yr[i] +1)
      }
      else{
      }

Any help is greatly appreciated and hopefully what I am asking makes sense. I am hoping to understand re-ordering for the future as I believe it will be an important tool for me to have at my disposal for coding in R.

Upvotes: 3

Views: 98

Answers (1)

JasonAizkalns
JasonAizkalns

Reputation: 20463

Here's one way to accomplish this with the packages dplyr and tidyr:

library(dplyr)
library(tidyr)

usflu_df <- tbl_df(usflu)

usflu_df %>%
  complete(YEAR, WEEK) %>%
  filter(!(YEAR == 1998 & WEEK < 30)) %>%
  mutate(season = cumsum(WEEK == 30),
         season_nm = paste(1997 + season, 1998 + season, sep = "-")) %>%
  group_by(season) %>%
  mutate(new_wk = seq_along(season)) %>%
  select(YEAR, WEEK, new_wk, season, season_nm)

#     YEAR  WEEK new_wk season season_nm
#    (int) (int)  (int)  (int)     (chr)
# 1   1998    30      1      1 1998-1999
# 2   1998    31      2      1 1998-1999
# 3   1998    32      3      1 1998-1999
# 4   1998    33      4      1 1998-1999
# 5   1998    34      5      1 1998-1999
# 6   1998    35      6      1 1998-1999
# 7   1998    36      7      1 1998-1999
# 8   1998    37      8      1 1998-1999
# 9   1998    38      9      1 1998-1999
# 10  1998    39     10      1 1998-1999

Talking through this...

First, use tidyr::complete to turn implicit missing values into explicit missing values -- the original data pulled back did not have all of the weeks for 1998. Next, filter out the irrelevant records from 1998, that is, anything with a week before 1998 and week 30 to make our lives easier. We then create two new variables, season and season_nm via cumsum and a simple paste function. The season simply increments anytime it sees WEEK == 30 -- this is useful because of leap years. We then group_by season so that we can seq_along season to create the new_wk variable.

Upvotes: 2

Related Questions