santoku
santoku

Reputation: 3427

R loop through columns of a data frame to create new cols based on start end years

I have a data frame of leases with start/end years like this

region=c("a","b","c","d")
lease=c("x","y","z","k")
startyr=c(2000,2001,2003,2002)
endyr=c(2004,2004,2006,2005)
annualAmt=c(7000,8500,6000,5500)
df=data.frame(region,lease,startyr,endyr,annualAmt)

I want to spread out the years by reshaping the data frame into this desired output:

region lease 2000 2001 2002 ... 2006
a x 7000 7000 7000 7000 7000 0 0
b y 0 8500 8500 8500 8500 0 0 

The logic is that if a lease covers year 2000-2004, its amt will count into the 2000,2001..2004 col

What's the best way to do it? If I write a loop, how should iI name the newly created year cols 2000-2006? Or should I use apply?

Upvotes: 1

Views: 282

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Here's an alternative that mostly involves basic addition and subtraction:

Rows <- df$endyr - df$startyr               # How many times to repeat rows?
df <- df[rep(rownames(df), Rows), ]         # Repeat the rows
df$year <- df$startyr + sequence(Rows) - 1  # Add a new "year" variable
reshape(df, direction = "wide",             # Reshape, long to wide
        idvar = c("region", "lease"),       # idvars are the first two cols
        timevar = "year",                   # timevar is the new year col
        drop = c("startyr", "endyr"))       # and drop the start/endyr cols
#   region lease annualAmt.2000 annualAmt.2001 annualAmt.2002
# 1      a     x           7000           7000           7000
# 2      b     y             NA           8500           8500
# 3      c     z             NA             NA             NA
# 4      d     k             NA             NA           5500
#   annualAmt.2003 annualAmt.2004 annualAmt.2005
# 1           7000             NA             NA
# 2           8500             NA             NA
# 3           6000           6000           6000
# 4           5500           5500             NA

Alternatively, you can use "data.table", like this:

library(data.table)
## Start with your original df
dt <- data.table(df)
dcast.data.table(
  DT[, list(year = seq(startyr, endyr), 
            annualAmt), 
     by = list(region, lease)], 
  region + lease ~ year, 
  value.var = "annualAmt", fill = 0)
#    region lease 2000 2001 2002 2003 2004 2005 2006
# 1:      a     x 7000 7000 7000 7000 7000    0    0
# 2:      b     y    0 8500 8500 8500 8500    0    0
# 3:      c     z    0    0    0 6000 6000 6000 6000
# 4:      d     k    0    0 5500 5500 5500 5500    0

Upvotes: 1

MrFlick
MrFlick

Reputation: 206197

How about

years <- seq(min(df$startyr), max(df$endyr))

dd <- data.frame(region, lease, t(mapply(function(a,b, v) {
    v* !is.na(match(years, seq(a, b)))
}, startyr, endyr, annualAmt)))

names(dd)[-(1:2)]<-years
dd

which returns

  region lease 2000 2001 2002 2003 2004 2005 2006
1      a     x 7000 7000 7000 7000 7000    0    0
2      b     y    0 8500 8500 8500 8500    0    0
3      c     z    0    0    0 6000 6000 6000 6000
4      d     k    0    0 5500 5500 5500 5500    0

Upvotes: 0

Related Questions