Reputation: 57
I am searching for a solution to speed up my code. I'm working with a dataset of approx. 100.000 rows and am currently using a double for-loop. This, I guess, is slowing down my code.
Example data:
dt<-structure(list(name = c("Marcus", "Tina", "Jack", "George"),
address = c("Oxford Str.", "Oxford Str.", "Waterloo Sq.",
"London Str."), number = c(1, 1, 20, 15), suffix = c("a",
"a", NA, "b"), child = c("Tina", NA, "George", NA)), .Names = c("name",
"address", "number", "suffix", "child"), row.names = c(NA, -4L
), class = "data.frame")
Example DataFrame:
name address number suffix child
1 Marcus Oxford Str. 1 a Tina
2 Tina Oxford Str. 1 a
3 Jack Waterloo Sq. 20 George
4 George London Str. 15 b
I am iterating for every row to check if child lives on same address and put a "1" in new column 'Output'. It is "0" by default. Result should be:
Example result:
name address number suffix child output
1 Marcus Oxford Str. 1 a Tina 1
2 Tina Oxford Str. 1 a
3 Jack Waterloo Sq. 20 George 0
4 George London Str. 15 b
My current code:
df$output = 0
n = NROW(df)
for(i in 1:n) {
childID = df[i,5]
address = df[i,2]
number = df[i,3]
suffix = df[i,4]
for(j in 1:n) {
if((childID %in% df[j,1])&(address %in% df[j,2])&(number %in% df[j,3])
&(suffix %in% df[j,4]))
(df[i,6] = 1)
}
}
I have tried to use Rcpp with C++ code. It is working too but still quite slow. Any ideas to speed this up or should I just accept it is gonna take some time to run it?
Upvotes: 2
Views: 87
Reputation: 2528
I have implemented a data.table
solution, which for this particular dataset is slower than @digEmAll solution, but perhaps helpful nonetheless.
Additionally, I provide some small benchmark, which is not really meaningful on this small dataset, so please test it on a bigger one.
library(data.table)
name = c("Marcus", "Tina", "Jack", "George")
address = c("Oxford Str.", "Oxford Str.", "Waterloo Sq.", "London Str.")
number = c(1, 1, 20, 15)
suffix = c("a", "a", "", "b")
child = c("Tina", "", "George", "")
dt <- data.table(name
, address
,number
,suffix
,child
)
dt[, FullAddr := paste0(address, " " , number, suffix)]
dt[ FullAddr[match(child,name)] == FullAddr, output := 1 ]
dt[is.na(output), output := 0]
dt
name address number suffix child FullAddr output
1: Marcus Oxford Str. 1 a Tina Oxford Str. 1a 1
2: Tina Oxford Str. 1 a Oxford Str. 1a 0
3: Jack Waterloo Sq. 20 George Waterloo Sq. 20 0
4: George London Str. 15 b London Str. 15b 0
library(microbenchmark)
microbenchmark(
a = {dt[ FullAddr[match(child,name)] == FullAddr, output := 1 ]}
, b= {df$output = 0
n = NROW(df)
for(i in 1:n) {
childID = df[i,5]
address = df[i,2]
number = df[i,3]
suffix = df[i,4]
for(j in 1:n) {
if((childID %in% df[j,1])&(address %in% df[j,2])&(number %in% df[j,3])
&(suffix %in% df[j,4]))
(df[i,6] = 1)
}
}}
, c = df$output[fulladdr[match(df$child,df$name)] == fulladdr] <- 1
, times = 100L
)
Unit: microseconds
expr min lq mean median uq max neval cld
a 298.842 348.347 427.59415 413.6995 489.4665 903.467 100 a
b 15042.275 15494.461 17983.16735 15864.5405 16257.7130 162306.656 100 b
c 39.847 46.487 58.82731 59.1655 64.7495 165.420 100 a
Upvotes: 1
Reputation: 26823
Here a solution based on hashmap
as mentioned in the comments:
df <- read.csv(text = 'name,address,number,suffix,child
Marcus,Oxford Str.,1,a,Tina
Tina,Oxford Str.,1,a,
Jack,Waterloo Sq.,20,,George
George,London Str.,15,b,', stringsAsFactors = FALSE)
df
library(hashmap)
address <- paste(df$address, df$number, df$suffix)
name_address <- hashmap(df$name, address)
child_address <- name_address[[df$child]]
output <- as.integer(child_address == address)
output <- ifelse(is.na(output), '', as.character(output))
df$output <- output
df
Output:
> df
name address number suffix child output
1 Marcus Oxford Str. 1 a Tina 1
2 Tina Oxford Str. 1 a
3 Jack Waterloo Sq. 20 George 0
4 George London Str. 15 b
Upvotes: 1
Reputation: 57210
I would try to concatenate the addresses then use match
, like this :
# recreate your input (I put NAs where you have blanks)
DF <-
data.frame(name=c('Marcus','Tina','Jack','George'),
address=c('Oxford Str.','Oxford Str.','Waterloo Sq.','London Str.'),
number=c(1,1,20,15),
suffix=c('a','a',NA,'b'),
child=c('Tina',NA,'George',NA))
# create a single character address by concatenating address,number and suffix
fulladdr <- paste(DF$address,DF$number,DF$suffix,sep='||')
# initialize output to 0
DF$output <- 0
# set 1 where concatenated addresses match
DF$output[fulladdr[match(DF$child,DF$name)] == fulladdr] <- 1
> DF
name address number suffix child output
1 Marcus Oxford Str. 1 a Tina 1
2 Tina Oxford Str. 1 a <NA> 0
3 Jack Waterloo Sq. 20 <NA> George 0
4 George London Str. 15 b <NA> 0
Upvotes: 3