Rozgonyi
Rozgonyi

Reputation: 1069

Alternative to for-loop to fill data.frame by unique rows

I am trying to make a script to generate a random set of people with demographic information using R. I want it to generate by rows and not columns so that a function can be based on the outcome of the previous function in the same row. I know this can be done with a for loop (as I did below) but for loops are extremely slow in R. I have read that you can use apply or while to do a loop much more efficiently but I haven't figured out how despite many failed attempts. A sample of the functional code with the loop is below. How would I do that with apply or while?

y <- 1980 ## MedianYr
d <- 0.1 ## Rate of NA responses

AgeFn <- function(y){
  Year <- 1900 + as.POSIXlt(Sys.Date())$year
  RNormYr <- as.integer((rnorm(1)*10+y))
  Age <- Year - RNormYr
}

EduByAge <- function (Age, d) {
  ifelse(Age < 17, sample(c("Some High School",NA), size=1,prob=c((1-d),d)),
    ifelse(Age > 16 & Age < 19, sample(c("Some High School", "High School Grad",NA), size=1, prob=c(0.085, 0.604,d)),
      ifelse(Age > 18 & Age < 21, sample(c("Some High School", "High School Grad", "Associates",NA), size=1,prob=c(0.085, 0.25, 0.354,d)),
        ifelse(20 > Age & Age < 23, sample(c("Some High School", "High School Grad", "Associates", "Bachelors",NA), size=1,prob=c(0.085, 0.25, 0.075, 0.279,d)),
          ifelse(Age > 22, sample(c("Some High School", "High School Grad", "Associates", "Bachelors", "Masters", "Professional", "Doctorate",NA),size=1,prob=c(0.085, 0.25, 0.075, 0.176, 0.072, 0.019, 0.012,d)), NA)))))
}

GenderFn <- function(d){
   Gender1 <- sample(c("Male","Female","Trans", NA), 1, replace=TRUE, prob=c(0.49, 0.5, 0.01, d))
   return(Gender1)
}

UserGen <- function(n,s) {
  set.seed(s)  
  Rows <- function(y,d){
    Age <- abs(AgeFn(y))
    Gender <- GenderFn(d)
    Education <- EduByAge(Age,d)
    c(i, Age, Gender, Education)
  } 
  df <- data.frame(matrix(NA, ncol = 4, nrow = n))
  for(i in (1:n)) {
    df[i,] <- Rows(y,d)
  }
  colnames(df) <- c("ID", "Age", "Gender", "Education")
  return(df)
}

Upvotes: 1

Views: 1009

Answers (3)

TARehman
TARehman

Reputation: 6749

So, the way you have written your code means that you will end up at least one loop.

apply is used to apply a function to each of the elements of another structure. So, it will work when you want to pass the vector containing all the ages to the other functions. It, however, isn't so hot for running the AgeFn() function that you have, because that doesn't take as an argument anything over which you would want to iterate.

Here's an alternative possibility, which ditches your method of getting random ages in favor of the sample function. I made a few assumptions, but I'm hoping that the explanation helps you figure out how this all works in R.

y <- 1980       ## MedianYr
d <- 0.1        ## Rate of NA responses
agemin <- 14
agemax <- 90

# The stats guy in me thinks that you might have some
# methodological problems here with how the ages are assigned
# But I'm just going to stick with it for now
EduByAge <- function (Age, d) {
    ifelse(Age < 17, sample(c("Some High School",NA), size=1,prob=c((1-d),d)),
           ifelse(Age > 16 & Age < 19, sample(c("Some High School", "High School Grad",NA), size=1, prob=c(0.085, 0.604,d)),
                  ifelse(Age > 18 & Age < 21, sample(c("Some High School", "High School Grad", "Associates",NA), size=1,prob=c(0.085, 0.25, 0.354,d)),
                         ifelse(20 > Age & Age < 23, sample(c("Some High School", "High School Grad", "Associates", "Bachelors",NA), size=1,prob=c(0.085, 0.25, 0.075, 0.279,d)),
                                ifelse(Age > 22, sample(c("Some High School", "High School Grad", "Associates", "Bachelors", "Masters", "Professional", "Doctorate",NA),size=1,prob=c(0.085, 0.25, 0.075, 0.176, 0.072, 0.019, 0.012,d)), NA)))))
}

NewUserGen <- function(n,s) {

    set.seed(s)

    ## Start by creating a data frame with IDs
    fakedata <- data.frame(ID=1:n)

    # Rather than a function, here I just used the built-in sample function
    # I am sampling for n ages lying between agemin and agemax
    # Using dnorm(), I assume a normal distribution of the ages, with
    # mean age equal to today's year minus the "MedianYr" you were using above
    # I assume that the mean and the SD are equal, you don't have to do that

    # I put in a few extra carriage returns here to make things not quite so
    # tight together - figured it would be easier to read.
    fakedata$Age <- sample(x=agemin:agemax,size=n,replace=TRUE,
                           prob=
                           dnorm(agemin:agemax,
                           mean=abs(y-as.numeric(format.Date(Sys.Date(),"%Y"))),
                           sd=abs(y-as.numeric(format.Date(Sys.Date(),"%Y")))))

    # I'm sure you know this, but you have some issues here
    # namely that you have a probability vector that totals to more than 1.
    # You might be getting no NAs as a result.
    fakedata$Gender <- sample(c("Male","Female","Trans", NA), 
                              n, replace=TRUE, prob=c(0.49, 0.5, 0.01, d))

    # Here is the actually sapply()
    fakedata$Edu <- sapply(fakedata$Age,FUN=EduByAge,d=0.1)

    return(fakedata)
}

outdata <- NewUserGen(300,10201)

This is how the data looks in aggregate afterward:

outdata$Edu <- factor(outdata$Edu,levels=c("Some High School",
                                           "High School Grad",
                                           "Associates",
                                           "Bachelors",
                                           "Masters",
                                           "Doctorate"),ordered=TRUE)

hist(outdata$Age)
barplot(table((outdata$Gender)))
par(mai=c(3,1,1,1))
barplot(table((outdata$Edu)),las=2)

Edu Distribution Gender Distribution Age Histogram

Upvotes: 1

kith
kith

Reputation: 5564

I'd modify the Rows function to take in an ID, rather than using the scoped "i".

Rows <- function(i, y,d){
    Age <- abs(AgeFn(y))
    Gender <- GenderFn(d)
    Education <- EduByAge(Age,d)
    c(i, Age, Gender, Education)
} 

Then you can call your function with lapply:

res1 = lapply(1:3000, function(i){
    Rows(i, y, d)
})

This alone doesn't really improve the speed, but if you're on a machine with multiple cores, you might be able to get some use out of the "multicore" library, with its mclapply function.

library("multicore")
res2 = mclapply(1:3000, function(i){
    Rows(i, y,d)
}) 

Oh, and if you want to use the results as a dataframe you can do:

df = data.frame(do.call(rbind, res1))

Upvotes: 0

Blue Magister
Blue Magister

Reputation: 13363

For the main function, you can use something from the apply family of functions, namely replicate. The speed improvement comes from the fact that R is an assign-by-copy language, and the for loop needlessly copies your data frame:

UserGen2 <- function(n,s) {
  set.seed(s)  
  Rows <- function(y,d) {
    Age <- abs(AgeFn(y))
    Gender <- GenderFn(d)
    Education <- EduByAge(Age,d)
    c(Age, Gender, Education)
  } 
  samp <- t(replicate(n,Rows(y,d)))
  colnames(samp) <- c("Age","Gender","Education")
  data.frame(ID=seq_len(dim(samp)[1]),samp)
}

There are probably other improvements you can do as well.

Upvotes: 0

Related Questions