Reputation: 12747
I have a a big dataframe in R that all looks about like this:
name amount date1 date2 days_out year
JEAN 318.5 1971-02-16 1972-11-27 650 days 1971
GREGORY 1518.5 <NA> <NA> NA days 1971
JOHN 318.5 <NA> <NA> NA days 1971
EDWARD 318.5 <NA> <NA> NA days 1971
WALTER 518.5 1971-07-06 1975-03-14 1347 days 1971
BARRY 1518.5 1971-11-09 1972-02-09 92 days 1971
LARRY 518.5 1971-09-08 1972-02-09 154 days 1971
HARRY 318.5 1971-09-16 1972-02-09 146 days 1971
GARRY 1018.5 1971-10-26 1972-02-09 106 days 1971
If someone's days_out is less than 60, they get a 90% discount. 60-90, a 70% discount. I need to find out the discounted sum of all the amounts for each year. My utterly embarrassing workaround is to write a python script that writes an R script that reads like this for each relevant year:
tmp <- members[members$year==1971, ]
tmp90 <- tmp[tmp$days_out <= 60 & tmp$days_out > 0 & !is.na(tmp$days_out), ]
tmp70 <- tmp[tmp$days_out <= 90 & tmp$days_out > 60 & !is.na(tmp$days_out), ]
tmp50 <- tmp[tmp$days_out <= 120 & tmp$days_out > 90 & !is.na(tmp$days_out), ]
tmp30 <- tmp[tmp$days_out <= 180 & tmp$days_out >120 & !is.na(tmp$days_out), ]
tmp00 <- tmp[tmp$days_out > 180 | is.na(tmp$days_out), ]
details.1971 <- c(1971, nrow(tmp),
nrow(tmp90), sum(tmp90$amount), sum(tmp90$amount) * .9,
nrow(tmp70), sum(tmp70$amount), sum(tmp70$amount) * .7,
nrow(tmp50), sum(tmp50$amount), sum(tmp50$amount) * .5,
nrow(tmp30), sum(tmp30$amount), sum(tmp90$amount) * .9,
nrow(tmp00), sum(tmp00$amount))
membership.for.chart <- rbind(membership.for.chart,details.1971)
It works just fine. The tmp frames and vectors get overwritten which is fine. But I know that I've utterly defeated everything that is elegant and efficient about R here. I launched R for the first time a month ago and I think I've come a long way. But I would really like to know how I should have gone about this?
Upvotes: 1
Views: 140
Reputation: 173577
Wow, you wrote a Python script that generates an R script? Consider my eyebrows raised...
Hopefully this will get you started:
#Import your data; add dummy column to separate 'days' suffix into its own column
dat <- read.table(text = " name amount date1 date2 days_out dummy year
JEAN 318.5 1971-02-16 1972-11-27 650 days 1971
GREGORY 1518.5 <NA> <NA> NA days 1971
JOHN 318.5 <NA> <NA> NA days 1971
EDWARD 318.5 <NA> <NA> NA days 1971
WALTER 518.5 1971-07-06 1975-03-14 1347 days 1971
BARRY 1518.5 1971-11-09 1972-02-09 92 days 1971
LARRY 518.5 1971-09-08 1972-02-09 154 days 1971
HARRY 318.5 1971-09-16 1972-02-09 146 days 1971
GARRY 1018.5 1971-10-26 1972-02-09 106 days 1971",header = TRUE,sep = "")
#Repeat 3 times
df <- rbind(dat,dat,dat)
#Create new year variable
df$year <- rep(1971:1973,each = nrow(dat))
#Breaks for discount levels
ct <- c(0,60,90,120,180,Inf)
#Cut into a factor
df$fac <- cut(df$days_out,ct)
#Create discount amounts for each row
df$discount <- c(0.9,0.7,0.5,0.9,1)[df$fac]
df$discount[is.na(df$discount)] <- 1
#Calc adj amount
df$amount_adj <- with(df,amount * discount)
#I use plyr a lot, but there are many, many
# alternatives
library(plyr)
ddply(df,.(year),summarise,
amt = sum(amount_adj),
total = length(year),
d60 = length(which(fac == "(0,60]")))
I only calculated a few of your summary values in the last ddply
command. I'm assuming you can extend it yourself.
Upvotes: 2
Reputation: 263342
You can use either the cut
function or the findInterval
function. The exact code will depend on the internals of the object which are not unambiguously communicated with console output. If that days_out
is a difftime-object. then something like this might work:
disc_amt <- with(tmp, amount*c(.9, .7, .5, .9, 1)[
findInterval(days_out, c(0, 60, 90, 120, 180, Inf] )
You should post the output of dput()
on that tmp
object or perhaps dput(head(tmp, 20))
if its really big, and testing can proceed. (The actual discounts did not seem to be ordered in a manner I would have expected.)
Upvotes: 2