Reputation: 1119
I'm trying to find out the most efficient way of joining data from one dataframe into another. The idea is that I have a master data set (df) and a secondary dataset (lookup). I want to append the the data in the lookup table to the master data set.
Theoretical data as follows:
COLUMN_A <- 1:5
COLUMN_B <- 1:5
LOOKUP_COL <- letters[1:5]
df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL)
COLUMN_A COLUMN_B LOOKUP_COL
1 1 1 a
2 2 2 b
3 3 3 c
4 4 4 d
5 5 5 e
COLUMN_A <- 2*(1:5)
LOOKUP_COL <- letters[1:5]
SPARE_COL <- runif(5)
lookup <- data.frame(COLUMN_A,LOOKUP_COL,SPARE_COL)
COLUMN_A LOOKUP_COL SPARE_COL
1 1 a 0.6113499
2 2 b 0.3712987
3 3 c 0.3551038
4 4 d 0.6650248
5 5 e 0.2680611
This is how I've been doing it so far:
results <- merge(df,lookup,by='LOOKUP_COL')
Which provides me with:
LOOKUP_COL COLUMN_A.x COLUMN_B COLUMN_A.y SPARE_COL
1 a 1 1 1 0.6113499
2 b 2 2 2 0.3712987
3 c 3 3 3 0.3551038
4 d 4 4 4 0.6650248
5 e 5 5 5 0.2680611
So it seems that the entire lookup table has been merged into the master data, SPARE_COL is surplus to requirements - how can I control what columns get passed into the master data? Essentially, I'm trying to understand how the functionality of an excel vlookup can be used in R.
thanks
Upvotes: 2
Views: 132
Reputation: 108573
EDIT: This one uses SPARE_COL as the one to keep instead of COLUMN_A. If you have columns with the same name in different dataframes, the solution with indices will require that you rename them in one of the data frames before merging everything together.
You can do this by passing only the columns you want to merge to the function merge
. Obviously you have to keep the columns used for merging in your selection. Taking your example, this becomes:
keep <- c('LOOKUP_COL','SPARE_COL')
results <- merge(df,lookup[keep],by='LOOKUP_COL')
And the result is
> results
LOOKUP_COL COLUMN_A COLUMN_B SPARE_COL
1 a 1 1 0.75670441
2 b 2 2 0.52122950
3 c 3 3 0.99338019
4 d 4 4 0.71904088
5 e 5 5 0.05405722
By selecting the columns first, you make merge
work faster and you don't have to bother about finding the columns you want after the merge.
If speed is an issue and the merge is simple, you can speed things up by manually doing the merge using indices:
id <- match(df$LOOKUP_COL, lookup$LOOKUP_COL)
keep <- c('SPARE_COL')
results <- df
results[keep] <- lookup[id,keep, drop = FALSE]
This gives the same result, and gives a good speedup.
Let's create an example with 2 lookup columns first:
N <- 10000
COLUMN_A <- 1:N
COLUMN_B <- 1:N
LOOKUP_COL <- sample(letters[3:7], N, replace = TRUE)
LOOKUP_2 <- sample(letters[10:14], N, replace = TRUE)
df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL, LOOKUP_2)
COLUMN_A <- 2*(1:36)
LOOKUP_COL <- rep(letters[1:6], each = 6)
LOOKUP_2 <- rep(letters[10:15], times = 6)
SPARE_COL <- runif(36)
lookup <- data.frame(COLUMN_A,LOOKUP_COL, LOOKUP_2, SPARE_COL)
You can use merge again like this:
keep <- c('LOOKUP_COL','SPARE_COL', 'LOOKUP_2')
results <- merge(df,lookup[keep],by=c('LOOKUP_COL', 'LOOKUP_2'))
And you can use indices again. Before you match, you have to create the interaction between the lookup columns. You can do this using the function
interaction()
for any number of lookup columns:
lookups <- c('LOOKUP_COL','LOOKUP_2')
id <- match(interaction(df[lookups]),
interaction(lookup[lookups]))
keep <- c('SPARE_COL')
results <- df
results[keep] <- lookup[id,keep, drop = FALSE]
In the test below the speedup is about a 6-fold for the two-column case:
test replications elapsed relative user.self sys.self user.child
1 code1() 100 6.30 6.117 6.30 0 NA
2 code2() 100 1.03 1.000 1.03 0 NA
sys.child
1 NA
2 NA
The code for testing:
N <- 10000
COLUMN_A <- 1:N
COLUMN_B <- 1:N
LOOKUP_COL <- sample(letters[3:7], N, replace = TRUE)
LOOKUP_2 <- sample(letters[10:14], N, replace = TRUE)
df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL, LOOKUP_2)
COLUMN_A <- 2*(1:36)
LOOKUP_COL <- rep(letters[1:6], each = 6)
LOOKUP_2 <- rep(letters[10:15], times = 6)
SPARE_COL <- runif(36)
lookup <- data.frame(COLUMN_A,LOOKUP_COL, LOOKUP_2, SPARE_COL)
code1 <- function(){
keep <- c('LOOKUP_COL','SPARE_COL', 'LOOKUP_2')
results <- merge(df,lookup[keep],by=c('LOOKUP_COL', 'LOOKUP_2'))
}
code2 <- function(){
lookups <- c('LOOKUP_COL','LOOKUP_2')
id <- match(interaction(df[lookups]),
interaction(lookup[lookups]))
keep <- c('SPARE_COL')
results <- df
results[keep] <- lookup[id,keep, drop = FALSE]
}
require(rbenchmark)
benchmark(code1(),code2())
Upvotes: 1