Javier
Javier

Reputation: 1550

A fast way to merge named vectors of different length into a data frame (preserving name information as column name) in R

I have a list L of named vectors. For example, 1st element:

> L[[1]]
$event
[1] "EventA"

$time
[1] "1416355303"

$city
[1] "Los Angeles"

$region
[1] "California"

$Locale
[1] "en-GB"

when I unlist each element of the list the resulting vectors looks like this (for the 1st 3 elements):

> unlist(L[[1]])
    event          time          city        region        Locale 
 "EventA"  "1416355303" "Los Angeles"  "California"       "en-GB" 

> unlist(L[[2]])
   event         time       Locale 
"EventB" "1416417567"      "en-GB" 

> unlist(L[[3]])
   event properties.time 
 "EventM"    "1416417569" 

I have over 0.5 million elements in the list and each one has up to 42 of these feaures/names. I have to merge them into a dataframe taken into account their names and that not all of them have the same number of feaures or names (in the example above, V2 has no information for region and city). At the moment, what I do is a loop through the whole list:

df1 <- merge(stack(unlist(L[[1]])), stack(unlist(L[[2]])),
        by = "ind", all = TRUE)
suppressWarnings(for (i in 3:length(L)){
    df1 <- merge(df1, stack(unlist(L[[i]])), by = "ind", all = TRUE)
})
df1 <- as.data.frame(t(df1))

For the example above this returns:

                 V1     V2     V3         V4         V5
 ind             city  event Locale     region       time
 values.x Los Angeles EventA  en-GB California 1416355303
 values.y        <NA> EventB  en-GB       <NA> 1416417567
 values          <NA> EventM   <NA>       <NA> 1416417569

which is what I want. However, bearing in mind the length of the list and the fact that every time that the command:

df1 <- merge(df1, stack(unlist(L[[i]])), by = "ind", all = TRUE)

runs, loads the entire data frame (df1), the loop takes a very long time. Therefore, I was wondering if anyone knows a better/faster way to code this. In other words. Given a long list of named vectors with different lengths, is there a fast way to merge them into a data frame as the one described above.

For example, is there a way of doing this using foreach and %dopar%? In any case, any faster approach is welcome.

Upvotes: 3

Views: 2715

Answers (4)

wingwe
wingwe

Reputation: 41

The original post is about merging named vectors. Define the first two given in the example above as vectors:

>C1 <- c(event = "EventA", time = 1416355303, 
     city = "Los Angeles", region = "California",
     Locale = "en-GB")
>C2 <- c(event = "EventB", time = 1416417567,
           Locale = "en-GB")

If you want to merge them and are OK to give up the extra data in the longer vector vector, then you can index the longer vector by names in the shorter vector

>C1 <- C1[names(C2)]

Then just use rbind or cbind. Example with rbind

>C1_C2 <- rbind(C1,C2)
>C1_C2

   event    time         Locale 
C1 "EventA" "1416355303" "en-GB"
C2 "EventB" "1416417567" "en-GB"

You can combine the final two steps but will lose the name of the first vector if you do that

Upvotes: 1

Rich Scriven
Rich Scriven

Reputation: 99331

I've heard the data.table package is pretty fast. And rbindlist is perfect for this list.

library(data.table)
rbindlist(L, fill=TRUE)
#     event       time        city     region Locale
# 1: EventA 1416355303 Los Angeles California  en-GB
# 2: EventB 1416417567          NA         NA  en-GB
# 3: EventM 1416417569          NA         NA     NA

Upvotes: 5

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Here's a compact solution to consider:

library(reshape2)
dcast(melt(L), L1 ~ L2, value.var = "value")
#   L1        city  event Locale     region       time
# 1  1 Los Angeles EventA  en-GB California 1416355303
# 2  2        <NA> EventB  en-GB       <NA> 1416417567
# 3  3        <NA> EventM   <NA>       <NA> 1416417569

Upvotes: 2

Roland
Roland

Reputation: 132706

I'm not sure why you use merge. It seems to me like you should simply rbind.

L <- list(list(event = "EventA", time = 1416355303, 
               city = "Los Angeles", region = "California",
               Locale = "en-GB"),
          list(event = "EventB", time = 1416417567,
               Locale = "en-GB"),
          list(event = "EventM", time = 1416417569))

library(plyr)
do.call(rbind.fill, lapply(L, as.data.frame))
#   event       time        city     region Locale
#1 EventA 1416355303 Los Angeles California  en-GB
#2 EventB 1416417567        <NA>       <NA>  en-GB
#3 EventM 1416417569        <NA>       <NA>   <NA>

Upvotes: 2

Related Questions