Reputation: 907
I am trying to mutate a new column into a very large dataframe by using combinations of two variables to draw a value from another dataframe. I know I can do this without using a for loop, but the original dataframe is ~2.7 million rows and so a for loop takes prohibitively long.
Essentially, I have two dataframes like such:
df1 = data.frame(cbind(years = c(2001:2005), sites = c(1:5), var1 = rnorm(n=5)))
df1
years sites var1
1 2001 1 -0.01419947
2 2002 2 0.53729311
3 2003 3 0.89222231
4 2004 4 0.20600240
5 2005 5 0.24541548
df2 = data.frame(cbind(`2001` = rnorm(n = 5, mean = 2, sd = 1),
`2002` = rnorm(n = 5, mean = 2, sd = 1),
`2003` = rnorm(n = 5, mean = 2, sd = 1),
`2004` = rnorm(n = 5, mean = 2, sd = 1),
`2005` = rnorm(n = 5, mean = 2, sd = 1)))
colnames(df2) = c(2001:2005); rownames(df2) = c(1:5)
df2
2001 2002 2003 2004 2005
1 -0.1217767 0.6703649 3.6074038 2.3048512 3.0022530
2 2.6167986 1.7673236 0.9467254 0.9903685 1.8533297
3 0.3192424 2.2183726 0.8783466 2.7741821 0.1847018
4 2.3599459 0.5653315 3.8843616 3.3171480 2.9135520
5 1.5399871 2.8127713 1.2168152 2.1788167 2.1153329
where for df2
the column names are years, the rownames are sites, and in the actual version of df1
, each year/site combination is present many times.
I would like to end up with:
years sites var1 NewVar
1 2001 1 -0.01419947 1.322451
2 2002 2 0.53729311 3.083238
3 2003 3 0.89222231 1.106300
4 2004 4 0.20600240 2.723593
5 2005 5 0.24541548 2.4919137
Such that the variable NewVar
is made based on the combination of the values in years
and sites
which indicate the appropriate value to draw from df2.
What I want to do, is mutate
a column in df1 that, for each row, uses the combination of the site and year variables to find the appropriate value in df2
. As I mentioned before, I know this is possible to do with a for loop but on the computer I'm running on, a similar foreach loop took 6 hours running on 3 cores, so I'm really hoping for a quicker mutate version.
The only other thought for a solution I've had so far is to use a combination of indexing and a ton of ifelse
statements, but the real versions of the dataframes I'm working with have 702 unique combinations of site and years.
Any help would be greatly appreciated!
Upvotes: 1
Views: 4268
Reputation: 647
You could try to solve this using the data.table
package. It is a very fast package for larger amounts of data.
The idea is to melt df2
to long format, so each year X site value is stored in a separat line in the dataframe.
After doing so df1
and df2_long
can be joined together by the key-elements years
and sites
.
edit: you can leaf out all the data.table::
if you load the data.table
package. I just used them to indicate the data.table
functions.
set.seed(123)
df1 <- data.frame(cbind(years = c(2001:2005), sites = c(1:5), var1 = rnorm(n = 5)))
df2 <- data.frame(cbind(`2001` = rnorm(n = 5, mean = 2, sd = 1),
`2002` = rnorm(n = 5, mean = 2, sd = 1),
`2003` = rnorm(n = 5, mean = 2, sd = 1),
`2004` = rnorm(n = 5, mean = 2, sd = 1),
`2005` = rnorm(n = 5, mean = 2, sd = 1)))
colnames(df2) = c(2001:2005); rownames(df2) = c(1:5)
# helpercolum to melt the data
df2$site = rownames(df2)
# melt data and change varnames
df2_long = data.table::melt(df2, id.vars = "site")
names(df2_long) = c("sites", "years", "NewVar")
# set df1 as data.table
data.table::setDT(df1)
# set df2 as data.table and convert the factors to numerics, as @Gregor suggested
# in his post (this way you dont have to deal with common factor-struggles)
data.table::setDT(df2_long)
df2_long$sites = as.numeric(as.character(df2_long$sites))
df2_long$years = as.numeric(as.character(df2_long$years))
# set key-columns on which the join should be made
data.table::setkey(df1, years, sites)
data.table::setkey(df2_long, years, sites)
# leftjoin the data
df2_long[df1]
Thanks for your input @Gregor on rather changing factors to numeric then other way round.
Upvotes: 2
Reputation: 887028
An one-liner without reshaping and using the efficient row/column
indexing from base R
would be
df1$newvar <- df2[cbind(df1$sites, match(df1$years,names(df2)))]
df1
# years sites var1 newvar
#1 2001 1 -0.56047565 3.71506499
#2 2002 2 -0.23017749 2.35981383
#3 2003 3 1.55870831 0.03338284
#4 2004 4 0.07050839 1.27110877
#5 2005 5 0.12928774 3.25381492
set.seed(123)
df1 <- data.frame(cbind(years = c(2001:2005), sites = c(1:5), var1 = rnorm(n=5)))
df2 <- data.frame(cbind(`2001` = rnorm(n = 5, mean = 2, sd = 1),
`2002` = rnorm(n = 5, mean = 2, sd = 1),
`2003` = rnorm(n = 5, mean = 2, sd = 1),
`2004` = rnorm(n = 5, mean = 2, sd = 1),
`2005` = rnorm(n = 5, mean = 2, sd = 1)))
colnames(df2) <- 2001:2005
rownames(df2) <- 1:5
Upvotes: 0
Reputation: 4520
I think data.table
is probably better option here, however just to illustrate the same logic in tidyverse
:
library(tidyverse)
df2 %>% # pipe in df2
rowid_to_column('sites') %>% # assign rownames to 'sites'
gather(key = years, value = newVar, -sites) %>% # transworm df2 to long form
mutate(years = as.numeric(years)) %>% # convert 'years' into numeric
right_join(df1, by = c('years', 'sites')) %>% # join df1 and df2
select(years, sites, var1, newVar) # rearrange columns
# years sites var1 newVar
# 1 2001 1 -0.2324031 3.652280
# 2 2002 2 -1.6015391 4.144123
# 3 2003 3 -1.9813792 3.514144
# 4 2004 4 -0.6039213 2.334821
# 5 2005 5 0.3302109 3.416026
Upvotes: 3