Reputation: 556
I am trying to make a column containing full texts by iterating over elements in few columns of the data.table
. here is my current approach. It is working as I intended however, when the data.table
gets big, I get immense time waste.
library(data.table)
new_df <- data.table(text= c("RT A y...", "RT b...", "XYZ 3...", "RT Ca...", "IO"),
full_text= c(NA, NA, "XYZ 378978978", NA, NA),
status.text= c("A yes y...", "ball ball", NA, "Call ca...", NA),
status.full_text= c("A yes yes yes yes", NA, NA, "Call call call", NA))
# text full_text status.text status.full_text
# 1: RT A y... <NA> A yes y... A yes yes yes yes
# 2: RT b... <NA> ball ball <NA>
# 3: XYZ 3... XYZ 378978978 <NA> <NA>
# 4: RT Ca... <NA> Call ca... Call call call
# 5: IO <NA> <NA> <NA>
#
attach_texts_in_df <- function(give_me_df){
#make an empty vector to store texts
complete_texts <- c()
#loop through each elements of rows
for(i in seq_along(1:nrow(give_me_df))){
#check if text begins with RT
if(!grepl('^RT', give_me_df[i, "text"])){
#check if text is smaller than the full_text, while full text is not NA
if((nchar(give_me_df[i, "text"]) < nchar(give_me_df[i, "full_text"]))& !is.na(give_me_df[i, "full_text"])){
complete_texts <- c(complete_texts, give_me_df[i, "full_text"])
}else{
complete_texts <- c(complete_texts, give_me_df[i, "text"]) # if not, then it's original
}
}
else{
if((nchar(give_me_df[i, "status.text"]) < nchar(give_me_df[i, "status.full_text"]))& !is.na(give_me_df[i, "status.full_text"])){
complete_texts <- c(complete_texts, give_me_df[i, "status.full_text"])
}else{
complete_texts <- c(complete_texts, give_me_df[i, "status.text"])
}
}
}
#attached the proper texts
give_me_df$complete_text <- complete_texts
#return the vector
return(give_me_df)
}
new_df <- attach_texts_in_df(new_df)
#this was the what I was looking for and I got it when its small, but big one take a long time!!
# text full_text status.text status.full_text complete_text
# 1: RT A y... <NA> A yes y... A yes yes yes yes A yes yes yes yes
# 2: RT b... <NA> ball ball <NA> ball ball
# 3: XYZ 3... XYZ 378978978 <NA> <NA> XYZ 378978978
# 4: RT Ca... <NA> Call ca... Call call call Call call call
# 5: IO <NA> <NA> <NA> IO
I was wondering if someone can help me optimize this. R is new for me. I know apply functions exist, but I don't know how to work with those with custom functions like these.
I will appreciate your assistance and hint. Thank you.
EDIT: I did the following using data.table
function, however I am missing some data:
sample_fxn <- function(t,ft,st,sft){
if(!grepl('^RT', t)){
if((nchar(t) < nchar(ft)) & !is.na(ft)){
return(ft)
}else{
return(t)
}
}
else{
if((nchar(st) < nchar(sft))& !is.na(sft)){
return(sft)
}else{
return(st)
}
}
}
new_df <- new_df[ ,complete_texts := sample_fxn(text,
full_text,
status.text,
status.full_text)]
# text full_text status.text status.full_text complete_texts
# 1: RT A y... <NA> A yes y... A yes yes yes yes A yes yes yes yes
# 2: RT b... <NA> ball ball <NA> <NA>
# 3: XYZ 3... XYZ 378978978 <NA> <NA> <NA>
# 4: RT Ca... <NA> Call ca... Call call call Call call call
# 5: IO <NA> <NA> <NA> <NA>
This is my best attempt after reading the vectorized version from that R Inferno book which @Henrik shared. I came up with:
new_df$complete_texts <- ifelse(!grepl('^RT', new_df$text),
yes = ifelse((nchar(new_df$text) < nchar(new_df$full_text))& !is.na(new_df$full_text),
yes = new_df$full_text,
no = new_df$text
),
no = ifelse((nchar(new_df$status.text) < nchar(new_df$status.full_text))& !is.na(new_df$status.full_text),
yes = new_df$status.full_text,
no = new_df$status.text
)
)
This does gets the job done 3X faster. I was wondering if somebody can explain me better ways. I want to learn.
Upvotes: 3
Views: 180
Reputation: 34703
Make sure to read some intro material for data.table
-- in particular Introduction and Reference Semantics vignettes.
Next, the most glaring thing I see is a lack of vectorization. In low-level languages, you have to loop all the time; in R, you should always think -- do I really need a loop here? In your code, I see a scalar version of several vectorized functions being used:
grepl
works on vectorsnchar
works on vectorsis.na
works on vectorsWhenever possible, you should use the vector version -- there is some latency of calling a C function repeatedly compared to just calling it once:
for (i in 1:nrow(DT)) grepl('^RT', DT[i, "text"])
is guaranteed slower than grepl('^RT', DT$text)
Next, data.table
has some further overhead in calling [
repeatedly, because there is a lot going on internally inside [
to handle more complex "queries", so you should try and take advantage of that when possible!
Lastly, rather than updating the data.table
in your function, I would rather have the function return a vector that I can assign as a column -- the end goal is to have something like:
new_df[ , complete_text := my_function(.SD)]
Note that my_function(.SD)
is the same as my_function(new_df)
in this simple case -- the use of .SD
here is to get used to this syntax in more complex scenarios; see the .SD
vignette for more.
Here's my updated version of what I'll call get_complete_text
:
get_complete_text = function(DT) {
DT[ , fifelse(
grepl('^RT', text),
fifelse(
nchar(status.text) < nchar(status.full_text) & !is.na(status.full_text),
status.full_text,
status.text
),
fifelse(
nchar(text) < nchar(full_text) & !is.na(full_text),
full_text,
text
)
)]
}
new_df[ , complete_text := get_complete_text(.SD)][]
# text full_text status.text status.full_text complete_text
# 1: RT A y... <NA> A yes y... A yes yes yes yes A yes yes yes yes
# 2: RT b... <NA> ball ball <NA> ball ball
# 3: XYZ 3... XYZ 378978978 <NA> <NA> XYZ 378978978
# 4: RT Ca... <NA> Call ca... Call call call Call call call
# 5: IO <NA> <NA> <NA> IO
The nested fifelse
could be simplified with an intermediate column storing which text
columns to use based on the ^RT
condition:
idx = new_df[grepl('^RT', text), which=TRUE]
new_df[idx, c('rt_text', 'rt_full_text') := .(status.text, status.full_text)]
new_df[-idx, c('rt_text', 'rt_full_text') := .(text, full_text)]
new_df[ , complete_text :=
fifelse(
nchar(rt_text) < nchar(rt_full_text) & !is.na(rt_full_text),
rt_full_text,
rt_text
)
]
Alternatively, with the development version of data.table
there is fcase
, which you may find more readable (in this case, I think the nested fifelse
works fine):
get_complete_text = function(DT) {
DT[ , fcase(
grepl('^RT', text) & nchar(status.text) < nchar(status.full_text) & !is.na(status.full_text),
status.full_text,
grepl('^RT', text) & (nchar(status.full_text) >= nchar(status.text) | is.na(status.full_text)),
status.text,
# here, we're implicitly using that logically grepl('^RT') is now FALSE
nchar(text) < nchar(full_text) & !is.na(full_text),
full_text,
# there is some ongoing work to make this less ugly,
# but for now this is the catchall term -- we could also
# explicitly write out the conjugate condition nchar(full_text) >= ...
rep(TRUE, .N),
text
)]
}
Upvotes: 10