user7328626
user7328626

Reputation: 11

Adding new column using custom function in data frame using dplyr/data.table in R

I'm relatively new to R programing and am trying to figure out how to use custom functions to evaluate new columns of a data frame using dplyr or data.table in a memory efficient manner. Can someone please help

Here is a brief summary of my problem

Data frames 1 and 2 have the same type and number of columns

df1 <- data.frame(col1 = c("A", "B", "C"), col2 = c(10,20,30))
df2 <- data.frame(col1 = c("DA", "EE", "FB", "C"), col2 = c(10,20,30,40))

These data frames have millions of records.

Now I want to add a new column to one of the data frames (say df1) by using the values in df2.

library(dplyr)

calculateCol3 <- function(word) {
df2 %>%
    filter(grepl(paste0(word, "$"),col1) )%>%
    summarize(col3= sum(col2))
col3
}

df1 %>% group_by(col1) %>% mutate(col3 = calcualteCol3(col1))

This method works but it is painfully slow and I guess this is because of copying the data sets too many times. Can someone suggest a better way of doing the same? The expected result is:

col1 col2 col3
   A   10   10
   B   20   30
   C   30   40 

I also tried converting the data frames to data.table as follows

dt1 <- data.table(df1)
dt2 <- data.table(df2)

dt1[, col3 := calculateCol3(col1)}, by = 1:nrow(dt1)]

Everything seems to be slow. Am sure there is a better way to achieve this. Can someone help

Thanks

Upvotes: 1

Views: 742

Answers (2)

David Arenburg
David Arenburg

Reputation: 92300

If you want an efficient solution I would suggest you won't use regex and don't do by-row operations. If all your function is doing is to join by the last letter, you could just get that latter without using regex and then do a binary join using data.table (for efficiency)

library(data.table)
setDT(df2)[, EndWith := substring(col1, nchar(as.character(col1)))]
setDT(df1)[df2, col3 := i.col2, on = .(col1 = EndWith)]
df1
#    col1 col2 col3
# 1:    A   10   10
# 2:    B   20   30
# 3:    C   30   40

Now, by looking at your function, it seems like you also trying to sum the values in df2$col2 per join. No problem, you can run functions while doing a binary join in data.table too. Lets say this is your df2 (just to illustrate when you have more than a single value per last letter)

df2 <- data.frame(col1 = c("DA", "FA", "EE", "FB", "C", "fC"), col2 = c(10,20,10,30,40,30))
df2
#   col1 col2
# 1   DA   10
# 2   FA   20
# 3   EE   10
# 4   FB   30
# 5    C   40
# 6   fC   30

The first step is the same

setDT(df2)[, EndWith := substring(col1, nchar(as.character(col1)))]

While the second step will involve a binary join- just to the opposite way, while adding , by = .EACHI and specifying your desired function

setDT(df2)[df1, .(col2 = i.col2, col3 = sum(col2)), on = .(EndWith = col1), by = .EACHI]
#    EndWith col2 col3
# 1:       A   10   30
# 2:       B   20   30
# 3:       C   30   70

Upvotes: 3

thelatemail
thelatemail

Reputation: 93938

Using the fuzzyjoin package, I think you can make this work. E.g.:

#install.packages("fuzzyjoin")
df1$col1regex <- paste0(df1$col1,"$")
regex_join(df2, df1, by=c(col1="col1regex"), mode="right")

#  col1.x col2.x col1.y col2.y col1regex
#1     DA     10      A     10        A$
#2     FB     30      B     20        B$
#3      C     40      C     30        C$

Upvotes: 0

Related Questions