kurdtc
kurdtc

Reputation: 1621

R: Dataframe organisation, structuring and subsetting dataframe

I have the following dataframe in R (only sample data)

data <- data.frame(NAME=c("NAME1", "NAME1", "NAME1","NAME2","NAME2","NAME2"),
                   ID=c(47,47,47,259,259,259),
                   SURVEY_YEAR=c(1960,1961,1965,2007,2010,2014), 
                   REFERENCE_YEAR=c(1959,1960,1963,2004,2009,2011),
                   CUMULATIVE_SUM=c(-6,-10,-23,-9,NA,-40))

In a tabular form it looks like this:

  NAME  ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM
1 NAME1  47        1960           1959             -6
2 NAME1  47        1961           1960            -10
3 NAME1  47        1965           1963            -23
4 NAME2 259        2007           2004             -9
5 NAME2 259        2010           2009             NA
6 NAME2 259        2014           2011            -40

What I am trying to do is to restructure my dataframe so that it should look like this in the end:

   NAME  ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR
1 NAME1  47        1960           1959             -6                      0
2 NAME1  47        1961           1960            -10                     -6
3 NAME1  47        1965           1963            -23                    -10
4 NAME2 259        2007           2004             -9                      0
5 NAME2 259        2010           2009             NA                     NA
6 NAME2 259        2014           2011            -40                     -9

I am trying to achieve this by using the following code:

# loop through elements in data$CUMULATIVE_SUM
for (i in 1:length(data$CUMULATIVE_SUM)) { 
  # take value of upper row, but take NULL if in upper row there is another NAME or end of table
  if (i==1) {
    value=0 # If first row
  } else { 
    if (data$NAME[i-1]==data$NAME[i]) {
      value=data$CUMULATIVE_SUM[i-1] # Normal case: take upper value
    } else {
      value=0 # If other NAME
    }
  }
  data$CUMULATIVE_SUM_REFYEAR[i] <- value # Write new value in new column
}

Using this code, my result of the above code looks like this:

   NAME  ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR
1 NAME1  47        1960           1959             -6                      0
2 NAME1  47        1961           1960            -10                     -6
3 NAME1  47        1965           1963            -23                    -10
4 NAME2 259        2007           2004             -9                      0
5 NAME2 259        2010           2009             NA                     **-9**
6 NAME2 259        2014           2011            -40                     NA

As you may have noticed when comparing it with my desired solution the value of -9 is in the wrong place (marked in bold). Is there a way how I could make this exception if there is a NA value in a row? I am stuck. Thank you for your help!

Upvotes: 0

Views: 71

Answers (2)

akrun
akrun

Reputation: 886938

Using dplyr

  library(dplyr)
  left_join(data, data %>%
  group_by(NAME) %>% 
  filter(!is.na(CUMULATIVE_SUM)) %>%
  mutate(CUMULATIVE_SUM_REFYEAR= lag(CUMULATIVE_SUM, 1, 0)))
  #      NAME  ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR
  #1 NAME1  47        1960           1959             -6                      0
  #2 NAME1  47        1961           1960            -10                     -6
  #3 NAME1  47        1965           1963            -23                    -10
  #4 NAME2 259        2007           2004             -9                      0
  #5 NAME2 259        2010           2009             NA                     NA
  #6 NAME2 259        2014           2011            -40                     -9

Upvotes: 1

David Arenburg
David Arenburg

Reputation: 92282

Try

library(data.table)
setDT(data)[!is.na(CUMULATIVE_SUM), 
            CUMULATIVE_SUM_REFYEAR := c(0, CUMULATIVE_SUM[-.N]),
            by = NAME]
data
#     NAME  ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR
# 1: NAME1  47        1960           1959             -6                      0
# 2: NAME1  47        1961           1960            -10                     -6
# 3: NAME1  47        1965           1963            -23                    -10
# 4: NAME2 259        2007           2004             -9                      0
# 5: NAME2 259        2010           2009             NA                     NA
# 6: NAME2 259        2014           2011            -40                     -9

Upvotes: 2

Related Questions