kurdtc
kurdtc

Reputation: 1621

R: Transform and expand a dataframe in R

I have a sample dataframe in R that looks like this:

dat <- data.frame(NAME=c("SAMPLE1", "SAMPLE1", "SAMPLE1", "SAMPLE1", "SAMPLE2","SAMPLE2","SAMPLE2","SAMPLE2"),
                  ID=c(33,33,33,33,253,253,253,253),
                  SURVEY_YEAR=c(1959,1960,1961,1965,2002,2007,2010,2014), 
                  REFERENCE_YEAR=c(1959,1959,1960,1963,2002, 2004,2009,2011),
                  VALUE=c(0,-6,-10,-23,0,-9,NA,-40))

dat

  NAME     ID SURVEY_YEAR REFERENCE_YEAR VALUE
1 SAMPLE1  33        1959           1959     0
2 SAMPLE1  33        1960           1959    -6
3 SAMPLE1  33        1961           1960   -10
4 SAMPLE1  33        1965           1963   -23
5 SAMPLE2 253        2002           2002     0
6 SAMPLE2 253        2007           2004    -9
7 SAMPLE2 253        2010           2009    NA
8 SAMPLE2 253        2014           2011   -40

What I am trying to do is to expand and transform REFERENCE_YEAR and SURVEY_YEAR into one column YEAR so that the resulting dataframe looks like this:

NAME    ID  YEAR    VALUE
SAMPLE1 33  1959    0         # VALUE from REFERENCE_YEAR 1959
SAMPLE1 33  1959    0         # VALUE from SURVEY_YEAR 1959
--------------------------------------------------------------------------------
SAMPLE1 33  1959    0         # for REFERENCE_YEAR 1959, take previous VALUE
SAMPLE1 33  1960    -6        # VALUE from SURVEY_YEAR 1960
--------------------------------------------------------------------------------
SAMPLE1 33  1960    -6        # for REFERENCE_YEAR 1960, take previous VALUE
SAMPLE1 33  1961    -10       # VALUE from SURVEY_YEAR 1961
--------------------------------------------------------------------------------
SAMPLE1 33  1963    -10       # for REFERENCE_YEAR 1963, take previous VALUE (-10)
SAMPLE1 33  1965    -23       # VALUE from SURVEY_YEAR 1965
--------------------------------------------------------------------------------
SAMPLE2 253 2002    0         # VALUE from REFERENCE_YEAR 2002
SAMPLE2 253 2002    0         # VALUE from SURVEY_YEAR 2002
--------------------------------------------------------------------------------
SAMPLE2 253 2004    0         # for REFERENCE_YEAR 2004, take previous VALUE (0)
SAMPLE2 253 2007    -9        # VALUE taken from SURVEY_YEAR 2007
--------------------------------------------------------------------------------
SAMPLE2 253 2009    NA       # if one value is NA in a period (in this case 2009 to 2010), the whole period should be set to NA
SAMPLE2 253 2010    NA
--------------------------------------------------------------------------------
SAMPLE2 253 2011    -9       # for REFERENCE_YEAR 2011, take previous numerical VALUE (not NA,but -9)
SAMPLE2 253 2014    -40      # VALUE taken from SURVEY_YEAR 2014

Is there an easy way to do this?

EDIT: I'd like the data to be in the above structure because I'd like to plot like this (maybe its more understandable with the graphs?). Here NA values have been added where the series is discontinous (1962 in SAMPLE 1 and 2003 and 2008 in SAMPLE2). This is why the structure should be maintained like in the results window above.

enter image description here enter image description here

Upvotes: 0

Views: 146

Answers (1)

mgriebe
mgriebe

Reputation: 908

Fundamentally, your problem is one of assigning values to years using rules. It is not perfectly clear to me what those rules are, but as a start you can do something like this:

dat <- data.frame(NAME=c("SAMPLE1", "SAMPLE1", "SAMPLE1", "SAMPLE1", "SAMPLE2","SAMPLE2","SAMPLE2","SAMPLE2"),
              ID=c(33,33,33,33,253,253,253,253),
              SURVEY_YEAR=c(1959,1960,1961,1965,2002,2007,2010,2014), 
              REFERENCE_YEAR=c(1959,1959,1960,1963,2002, 2004,2009,2011),
              VALUE=c(0,-6,-10,-23,0,-9,NA,-40))

uyear=data.frame(UYEAR=unique(c(dat$SURVEY_YEAR,dat$REFERENCE_YEAR)),val=NA)
uyear<-uyear[with(uyear,order(UYEAR)),]

for(i in 1:nrow(uyear)) {
  if(uyear$UYEAR[i] %in% dat$SURVEY_YEAR) {
    uyear$val[i]=dat$VALUE[which(dat$SURVEY_YEAR==uyear$UYEAR[i])[1]]
  }else {uyear$val[i]=dat$VALUE[which(dat$REFERENCE_YEAR==uyear$UYEAR[i])[1]-1]}
}

That said, it is a bad idea to have "YEAR" mean two different things (start and end) without keeping a column that explains which is which.

Upvotes: 1

Related Questions