Prometheus
Prometheus

Reputation: 693

R How vectorize a function that depends on other observations

Hi I have a dataset as follows:

set.seed(100)
library(microbenchmark)
City=c("City1","City2","City2","City1","City2","City1","City2","City1")
Business=c("B","A","B","A","C","A","E","F")
SomeNumber=c(35,20,15,19,12,40,36,28)
zz=data.frame(City,Business,SomeNumber)
zz_new=do.call("rbind", replicate(1000,zz, simplify = FALSE))
zz_new$BusinessMax=0 #Initializing final variable of interest at 0

I am just duplicating the rows of dataframe zz 1000 times to measure performance later.

I also have a custom function as follows:

City1=function(full_data,observation){
  NewSet=full_data[which(full_data$City==observation$City & !full_data$Business==observation$Business),]
  NewSet2=max(NewSet$SomeNumber)
  return(NewSet2)
}

What I wish to do is apply the custom function to only those rows of zz_new where City==City1. I can create a logical object i1 which stores whether a particular row satisfies the condition as follows:

i1 <- zz_new[["City"]] == "City1"

Next, and this is where I need help with, I write a for loop (which takes up such a long time) as follows:

for (i in 1:nrow(zz_new[i1,])){
  zz_new[i1,][i,"BusinessMax"]=City1(full_data=zz_new, observation = zz_new[i1,][i,])
}
zz_new[i1,]

The above code provides the correct answer. However, it is extremely slow and inefficient. I run microbenchmark and obtain:

microbenchmark(
for (i in 1:nrow(zz_new[i1,])){
  zz_new[i1,][i,"BusinessMax"]=City1(full_data=zz_new, observation = zz_new[i1,][i,])
},times = 5)

      min       lq     mean   median       uq     max neval
 4.369269 4.400759 4.433388 4.401734 4.450246 4.54493     5

How should I go about vectorizing the function City1? In my actual code, I need to make multiple condition checks in the function City1 (here I have just used used two columns City and Business to subset the data but I need to include several other variables). A lot of the vectorizing codes on SO simply use information from only a given row. Unfortunately in my case, I need to combine information from the given row as well as the dataset. Any help will be appreciated. Thanks in advance.

EDIT 1:

Description of the funtion City1:

1st it creates a subset which keeps those observations where "City" of the observation provided is the same as the city of the dataset. From this subset, it removes those observations where "Business" of the observation is the same as that of the data. For eg. if the "City" and "Business" of observation provided is City1 and A respectively, then the subset will only consider those observations with City == City1 and Business not equal to A.

I also need to create other similar functions for other cities as well. But If someone could help me vectorize City1, I could try to do the same for the other functions.

Edit 2:

As an example, I write an alternate function for City == City2 as follows:

City2=function(full_data,observation){
      NewSet=full_data[which(full_data$City==observation$City & full_data$Business==observation$Business),]
      NewSet2=max(NewSet$SomeNumber)-(10*rnorm(1))
      return(NewSet2)
    }

In the above function, note that as compared to City1 , I removed the "!" symbol from NewSet and subtract (-10*rnorm) from the value NewSet2.

Next, I run it only for the observations where City == City2.

i2 <- zz_new[["City"]] == "City2"

for (i in 1:nrow(zz_new[i2,])){
  zz_new[i2,][i,"BusinessMax"]=City2(full_data=zz_new, observation = zz_new[i2,][i,])
}

Upvotes: 0

Views: 72

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145765

Here is a fast version to accomplish what your for loop of City1() does. It seems like you wanted to do this within each cities, so I did that.

library(data.table)
# convert to data table and set key for speed
zzdt = as.data.table(zz_new)
setkey(zzdt, City, Business)

# calculate the max for each business, by city, in City1 only
biz_max = zzdt[, .(BusinessMax = max(SomeNumber)), by = .(City, Business)]
# self-join the max values and filter out where the business match
# to get the max of other businesses within the same city
other_biz_max = 
  biz_max[biz_max, on = .(City), allow.cartesian = TRUE][
    Business != i.Business,
    .(BusinessMax = max(i.BusinessMax)),
    by = .(City, Business)
  ]
# join back to the original data
result = zzdt[other_biz_max]

If we want to apply this only to City == "City1", we can filter in the first step and make the final join a full join - the rest remains the same.

library(data.table)
# convert to data table and set key for speed
zzdt = as.data.table(zz_new)
setkey(zzdt, City, Business)

# calculate the max for each business in City1
biz_max = zzdt[City == "City1", .(BusinessMax = max(SomeNumber)), by = .(City, Business)]
# self-join the max values and filter out where the business match
# to get the max of other businesses within the same city
other_biz_max = 
  biz_max[biz_max, on = .(City), allow.cartesian = TRUE][
    Business != i.Business,
    .(BusinessMax = max(i.BusinessMax)),
    by = .(City, Business)
  ]
# join back to the original data
result = merge(zzdt, other_biz_max, by = c("City", "Business"), all = TRUE)

On my computer, the data.table method takes 0.03 seconds, and the method in your question takes 10.28 seconds, for a speed-up of approximately 300x. I included the data.table conversion and key setting in that time, but if you use data.table and with that key, the rest of you code can speed up as well.

Upvotes: 2

Related Questions