Geoffrey Stoel
Geoffrey Stoel

Reputation: 1320

Why is this r code so slow?

I am trying to create a dataframe based on information in another dataframe.

the first dataframe (base_mar_bop) has data like:

201301|ABC|4
201302|DEF|12

my wish is to create a data frame from this with 16 rows in it:

4 times: 201301|ABC|1
12 times: 201302|DEF|1

I have written a script that takes ages to run. To get an idea the final dataframe has around 2 million rows and the source dataframe has about 10k rows. I can not post sourcefiles for the dataframes due to confidentiality of the data.

Since it took ages to run this code, I decided to do this in PHP and it ran in under a minute and got the job done, writing it to a txt file and then importing the txt file in R.

I have no clue why R takes so long.. Is it the calling of the function? Is it the nested for loop? From my point of view there are not that many computationally intensive steps in there.

# first create an empty dataframe called base_eop that will each subscriber on a row 

identified by CED, RATEPLAN and 1
# where 1 is the count and the sum of 1 should end up with the base
base_eop <-base_mar_bop[1,]

# let's give some logical names to the columns in the df
names(base_eop) <- c('CED','RATEPLAN','BASE')


# define the function that enables us to insert a row at the bottom of the dataframe
insertRow <- function(existingDF, newrow, r) {
  existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
  existingDF[r,] <- newrow
  existingDF
}


# now loop through the eop base for march, each row contains the ced, rateplan and number of subs
# we need to insert a row for each individual sub
for (i in 1:nrow(base_mar_eop)) {
  # we go through every row in the dataframe
  for (j in 1:base_mar_eop[i,3]) {
    # we insert a row for each CED, rateplan combination and set the base value to 1
    base_eop <- insertRow(base_eop,c(base_mar_eop[i,1:2],1),nrow(base_eop)) 
  }
}

# since the dataframe was created using the first row of base_mar_bop we need to remove this first row
base_eop <- base_eop[-1,]

Upvotes: 0

Views: 341

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Here is one approach with data.table, though @BenBolker's timings are already awesome.

library(data.table)
DT <- data.table(d2)  ## d2 from @BenBolker's answer
out <- DT[, ID:=1:.N][rep(ID, BASE)][, `:=`(BASE=1, ID=NULL)]
out
#            CED RATEPLAN BASE
#       1:     1        A    1
#       2:     1        A    1
#       3:     1        A    1
#       4:     1        A    1
#       5:     1        A    1
#      ---                    
# 1999996: 10000        Y    1
# 1999997: 10000        Y    1
# 1999998: 10000        Y    1
# 1999999: 10000        Y    1
# 2000000: 10000        Y    1

Here, I've used compound queries to do the following:

  • Create an ID variable that is really just 1 to the number of rows in the data.table.
  • Use rep to repeat the ID variable by the corresponding BASE value.
  • Replaced all BASE values with "1" and dropped the ID variable we created earlier.

Perhaps there is a more efficient way to do this though. For example, dropping one of the compound queries should make it a little faster. Perhaps something like:

out <- DT[rep(1:nrow(DT), BASE)][, BASE:=1]

Upvotes: 4

Ben Bolker
Ben Bolker

Reputation: 226097

I haven't tried any benchmarking yet, but this approach (illustrated on your mini-example) should be much faster:

d <- data.frame(x1=c(201301,201302),x2=c("ABC","DEF"),rep=c(4,12))
with(d,data.frame(x1=rep(x1,rep),x2=rep(x2,rep),rep=1))

A slightly more realistic example, with timing:

d2 <- data.frame(CED=1:10000,RATEPLAN=rep(LETTERS[1:25],
         length.out=10000),BASE=200) 
nrow(d2) ## 10000
sum(d2$BASE)  ## 2e+06
system.time(d3 <- with(d2,
      data.frame(CED=rep(CED,BASE),RATEPLAN=rep(RATEPLAN,BASE),
              BASE=1)))
##   user  system elapsed 
## 0.244   0.860   1.117 
nrow(d3)  ## 2000000 (== 2e+06)

Upvotes: 3

Related Questions