Amanda
Amanda

Reputation: 12747

How can I streamline this R script?

I have a a big dataframe in R that all looks about like this:

    name   amount   date1       date2  days_out year
    JEAN  318.5 1971-02-16 1972-11-27  650 days 1971
 GREGORY 1518.5       <NA>       <NA>   NA days 1971
    JOHN  318.5       <NA>       <NA>   NA days 1971
  EDWARD  318.5       <NA>       <NA>   NA days 1971
  WALTER  518.5 1971-07-06 1975-03-14 1347 days 1971
   BARRY 1518.5 1971-11-09 1972-02-09   92 days 1971
   LARRY  518.5 1971-09-08 1972-02-09  154 days 1971
   HARRY  318.5 1971-09-16 1972-02-09  146 days 1971
   GARRY 1018.5 1971-10-26 1972-02-09  106 days 1971

If someone's days_out is less than 60, they get a 90% discount. 60-90, a 70% discount. I need to find out the discounted sum of all the amounts for each year. My utterly embarrassing workaround is to write a python script that writes an R script that reads like this for each relevant year:

tmp <- members[members$year==1971, ]
tmp90 <- tmp[tmp$days_out <= 60  & tmp$days_out > 0  & !is.na(tmp$days_out),  ]
tmp70 <- tmp[tmp$days_out <= 90  & tmp$days_out > 60 & !is.na(tmp$days_out),  ]
tmp50 <- tmp[tmp$days_out <= 120 & tmp$days_out > 90 & !is.na(tmp$days_out),  ]
tmp30 <- tmp[tmp$days_out <= 180 & tmp$days_out >120 & !is.na(tmp$days_out),  ]
tmp00 <- tmp[tmp$days_out > 180 | is.na(tmp$days_out), ]
details.1971 <- c(1971, nrow(tmp),
  nrow(tmp90), sum(tmp90$amount), sum(tmp90$amount) * .9,
    nrow(tmp70), sum(tmp70$amount), sum(tmp70$amount) * .7,
    nrow(tmp50), sum(tmp50$amount), sum(tmp50$amount) * .5,
    nrow(tmp30), sum(tmp30$amount), sum(tmp90$amount) * .9,
    nrow(tmp00), sum(tmp00$amount))
membership.for.chart <- rbind(membership.for.chart,details.1971)

It works just fine. The tmp frames and vectors get overwritten which is fine. But I know that I've utterly defeated everything that is elegant and efficient about R here. I launched R for the first time a month ago and I think I've come a long way. But I would really like to know how I should have gone about this?

Upvotes: 1

Views: 140

Answers (2)

joran
joran

Reputation: 173577

Wow, you wrote a Python script that generates an R script? Consider my eyebrows raised...

Hopefully this will get you started:

#Import your data; add dummy column to separate 'days' suffix into its own column
dat <- read.table(text = "   name   amount   date1       date2  days_out dummy year
    JEAN  318.5 1971-02-16 1972-11-27  650 days 1971
 GREGORY 1518.5       <NA>       <NA>   NA days 1971
    JOHN  318.5       <NA>       <NA>   NA days 1971
  EDWARD  318.5       <NA>       <NA>   NA days 1971
  WALTER  518.5 1971-07-06 1975-03-14 1347 days 1971
   BARRY 1518.5 1971-11-09 1972-02-09   92 days 1971
   LARRY  518.5 1971-09-08 1972-02-09  154 days 1971
   HARRY  318.5 1971-09-16 1972-02-09  146 days 1971
   GARRY 1018.5 1971-10-26 1972-02-09  106 days 1971",header = TRUE,sep = "")

#Repeat 3 times
df <- rbind(dat,dat,dat)

#Create new year variable
df$year <- rep(1971:1973,each = nrow(dat))

#Breaks for discount levels
ct <- c(0,60,90,120,180,Inf)

#Cut into a factor
df$fac <- cut(df$days_out,ct)

#Create discount amounts for each row
df$discount <- c(0.9,0.7,0.5,0.9,1)[df$fac]
df$discount[is.na(df$discount)] <- 1

#Calc adj amount
df$amount_adj <- with(df,amount * discount)

#I use plyr a lot, but there are many, many
# alternatives
library(plyr)
ddply(df,.(year),summarise,
            amt = sum(amount_adj),
            total = length(year),
            d60 = length(which(fac == "(0,60]")))

I only calculated a few of your summary values in the last ddply command. I'm assuming you can extend it yourself.

Upvotes: 2

IRTFM
IRTFM

Reputation: 263342

You can use either the cut function or the findInterval function. The exact code will depend on the internals of the object which are not unambiguously communicated with console output. If that days_out is a difftime-object. then something like this might work:

disc_amt <- with(tmp, amount*c(.9, .7, .5, .9, 1)[
                                 findInterval(days_out, c(0, 60, 90, 120, 180, Inf] )

You should post the output of dput() on that tmp object or perhaps dput(head(tmp, 20)) if its really big, and testing can proceed. (The actual discounts did not seem to be ordered in a manner I would have expected.)

Upvotes: 2

Related Questions