Reputation: 1320
I am trying to create a dataframe based on information in another dataframe.
the first dataframe (base_mar_bop) has data like:
201301|ABC|4
201302|DEF|12
my wish is to create a data frame from this with 16 rows in it:
4 times: 201301|ABC|1
12 times: 201302|DEF|1
I have written a script that takes ages to run. To get an idea the final dataframe has around 2 million rows and the source dataframe has about 10k rows. I can not post sourcefiles for the dataframes due to confidentiality of the data.
Since it took ages to run this code, I decided to do this in PHP and it ran in under a minute and got the job done, writing it to a txt file and then importing the txt file in R.
I have no clue why R takes so long.. Is it the calling of the function? Is it the nested for loop? From my point of view there are not that many computationally intensive steps in there.
# first create an empty dataframe called base_eop that will each subscriber on a row
identified by CED, RATEPLAN and 1
# where 1 is the count and the sum of 1 should end up with the base
base_eop <-base_mar_bop[1,]
# let's give some logical names to the columns in the df
names(base_eop) <- c('CED','RATEPLAN','BASE')
# define the function that enables us to insert a row at the bottom of the dataframe
insertRow <- function(existingDF, newrow, r) {
existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
existingDF[r,] <- newrow
existingDF
}
# now loop through the eop base for march, each row contains the ced, rateplan and number of subs
# we need to insert a row for each individual sub
for (i in 1:nrow(base_mar_eop)) {
# we go through every row in the dataframe
for (j in 1:base_mar_eop[i,3]) {
# we insert a row for each CED, rateplan combination and set the base value to 1
base_eop <- insertRow(base_eop,c(base_mar_eop[i,1:2],1),nrow(base_eop))
}
}
# since the dataframe was created using the first row of base_mar_bop we need to remove this first row
base_eop <- base_eop[-1,]
Upvotes: 0
Views: 341
Reputation: 193517
Here is one approach with data.table
, though @BenBolker's timings are already awesome.
library(data.table)
DT <- data.table(d2) ## d2 from @BenBolker's answer
out <- DT[, ID:=1:.N][rep(ID, BASE)][, `:=`(BASE=1, ID=NULL)]
out
# CED RATEPLAN BASE
# 1: 1 A 1
# 2: 1 A 1
# 3: 1 A 1
# 4: 1 A 1
# 5: 1 A 1
# ---
# 1999996: 10000 Y 1
# 1999997: 10000 Y 1
# 1999998: 10000 Y 1
# 1999999: 10000 Y 1
# 2000000: 10000 Y 1
Here, I've used compound queries to do the following:
data.table
.rep
to repeat the ID variable by the corresponding BASE value.Perhaps there is a more efficient way to do this though. For example, dropping one of the compound queries should make it a little faster. Perhaps something like:
out <- DT[rep(1:nrow(DT), BASE)][, BASE:=1]
Upvotes: 4
Reputation: 226097
I haven't tried any benchmarking yet, but this approach (illustrated on your mini-example) should be much faster:
d <- data.frame(x1=c(201301,201302),x2=c("ABC","DEF"),rep=c(4,12))
with(d,data.frame(x1=rep(x1,rep),x2=rep(x2,rep),rep=1))
A slightly more realistic example, with timing:
d2 <- data.frame(CED=1:10000,RATEPLAN=rep(LETTERS[1:25],
length.out=10000),BASE=200)
nrow(d2) ## 10000
sum(d2$BASE) ## 2e+06
system.time(d3 <- with(d2,
data.frame(CED=rep(CED,BASE),RATEPLAN=rep(RATEPLAN,BASE),
BASE=1)))
## user system elapsed
## 0.244 0.860 1.117
nrow(d3) ## 2000000 (== 2e+06)
Upvotes: 3