learner
learner

Reputation: 959

For loop substitute for Big data

I am having huge data that contains millions of records. I am just sharing subset of it.

    data<-structure(list(email_address_hash = structure(c(1L, 1L, 2L, 2L, 
    2L, 3L, 3L), .Label = c("0004eca7b8bed22aaf4b320ad602505fe9fa9d26", 
    "00198ee5364d73796e0e352f1d2576f8e8fa99db", "35c0ef2c2a804b44564fd4278a01ed25afd887f8"
    ), class = "factor"), open_time = structure(c(1L, 5L, 7L, 3L, 
    2L, 4L, 6L), .Label = c(" 04:39:24", " 06:31:24", " 07:05:23", 
    " 09:57:20", " 10:39:43", " 19:00:09", " 21:12:04"), class = "factor")), .Names = c("email_address_hash", 
    "open_time"), row.names = c(NA, -7L),  class = c( 
    "data.frame"))
    require(data.table)
    setDT(data)

This is how my data looks like

enter image description here

I want to put open_times of every email_address_hash in front of it in form of a vector. I tried the below approach

data <- data[, .(open_times = paste(open_time, collapse = "")), by = email_address_hash]

str(data)
Classes ‘data.table’ and 'data.frame':  3 obs. of  2 variables:
 $ email_address_hash: Factor w/ 36231 levels "00012aec4ca3fa6f2f96cf97fc2a3440eacad30e",..: 2 16 7632
 $ open_times        : chr  " 04:39:24 10:39:43" " 21:12:04 07:05:23 06:31:24" " 09:57:20 19:00:09"
 - attr(*, ".internal.selfref")=<externalptr> 

There are two things which I want to resolve

1) First wanted to remove leading whitespace from open_times

2) I wanted to treat every open_times in front of email_address_hash individually . See below the element of open_times was concatenated into a single element.

Current Output

data$open_times[1]
[1] " 04:39:24 10:39:43"

NROW(data$open_times[1])
[1] 1

Desired Output

data$open_times[1]
[1]"04:39:24" "10:39:43"

NROW(data$open_times[1])
    [1] 2

For a single element I can do like

unlist(strsplit(trimws(data$open_times[1]),split = " "))

But as my data is huge I wanted to avoid for loop as it is taking so much to iterate over all these things.Can anyone provides me a solution that is faster on big data ? Millions or even Billions of records . Solution with data.table is more appreciable

Please let me know if anything is unclear to you.

Upvotes: 0

Views: 701

Answers (2)

Travis Heeter
Travis Heeter

Reputation: 14084

R is notoriously not good with big data - consider switching to hadoop. With that being said, here's an article on how to make R run faster: https://www.r-bloggers.com/five-ways-to-handle-big-data-in-r/.

As far as getting a column in a vector, I think columns are already vectors:

> data[[2]]
[1] "04:39:24" "10:39:43"
> NROW(data$open_time)
[1] 2

Edit: Thanks to @Frank for pointing out OP was using a data table.

Upvotes: 1

Jordan Mackie
Jordan Mackie

Reputation: 2406

Hadoop MapReduce might be what you need here. I've used it before for projects like counting the number of occurrences of phrases in huge collections of text. I imagine it could be repurposed for this too?

Upvotes: 1

Related Questions