Reputation: 6147

Spread multiple columns with values by one columne containing key

Input data :

> head(iris[c(48:50, 98:100), 3:5])
    Petal.Length Petal.Width    Species
48           1.4         0.2     setosa
49           1.5         0.2     setosa
50           1.4         0.2     setosa
98           4.3         1.3 versicolor
99           3.0         1.1 versicolor
100          4.1         1.3 versicolor

Output data :

setosa.Petal.Length versicolor.Petal.Length setosa.Petal.Width versicolor.Petal.Width 
1.4                 4.3                     0.2                1.3
1.5                 3.0                     0.2                1.1
1.4                 4.1                     0.2                1.3

using for example :

spread(
  iris%>%mutate(n=row_number()), 
  key=Species, 
  value=Petal.Length:Petal.Width) 
  #or c("Petal.Length", "Petal.Width")`

doesn't work

Upvotes: 2

Answers (2)

Carl Boneri

Reputation: 2722

Came across this and was curious about the mention on "large datasets". As an avid fan of everything tidyverse I am oftern benchmarking performace as I work with large sets constantly. For this example this is what I came up with, which was alarmingly in the favor of using base package functions over piping with tidyverse calls.

For these purposes using a split and base functions is alarmingly faster than piping. Anyone have any feedback on this ???

Replicating the iris data set to fake a larger data set.

iris_1000 <- plyr::ldply(1:1000, function(i){
  iris
})

Dimensions are 150,000 rows by 5 columns:

dim(iris_1000)
[1] 150000      5

So this would represent a dataset with a size of:

sprintf("%s MB", object.size(iris_1000) * 0.001^2)
[1] "5.401688 MB"

Wrapping the provided anser in a function call for testing but changing so that it applies to all columns, not just those containing 'Petal' and also dropping the index column and returning as a data frame

tidy_fn <- function(){
  iris_1000 %>% 
    gather(key, value, -Species) %>%
    unite(tmp, Species, key, sep=".") %>%
    group_by(tmp) %>%
    mutate(indx = row_number()) %>%
    spread(tmp, value) %>% 
    select(-indx) %>% data.frame
}

Function using mostly base package provided functions and taking an input of data frame as the 'df' variable and the column name that the data is to be aggregated by, or in our case split on

baseish_fn <- function(df, col_split){

  df_split <- split(df, df[[col_split]])

  df_loop <- lapply(names(df_split), function(i){
    iter_df <- df_split[[i]][-which(colnames(df_split[[i]]) == col_split)]
    new_names <- sprintf("%s.%s", i, colnames(iter_df))
    colnames(iter_df) <- new_names
    iter_df
  })
  names(df_loop) <- NULL
  as.data.frame(df_loop)
}

Now the speed comparisons were surprising... One note is that the column names in the baseish function will be arranged to match the returned col arrangement of the tudy function for later comparison of values so run once to get that

tidy_val_test <- tidy_fn()

microbenchmark::microbenchmark(
  tidyverse = tidy_fn(),
  `base-ish` = baseish_fn(iris_1000, "Species")[colnames(tidy_val_test)],times = 100L
)

Unit: milliseconds                                                       
      expr       min        lq     mean   median        uq      max neval
 tidyverse 378.46753 419.93116 447.8020 438.5375 466.78367 718.9666   100
  base-ish  76.57918  83.21915  92.8109  87.8329  94.58085 342.2290   100

And to ensure they produced identical outputs:

all(mapply(function(i){
    identical(tidyverse[[i]], `base-ish`[[i]])
},colnames(tidyverse)))

[1] TRUE


head(`base-ish`[1:4,1:5])
  setosa.Sepal.Length setosa.Sepal.Width setosa.Petal.Length setosa.Petal.Width
1                 5.1                3.5                 1.4                0.2
2                 4.9                3.0                 1.4                0.2
3                 4.7                3.2                 1.3                0.2
4                 4.6                3.1                 1.5                0.2
  versicolor.Sepal.Length
1                     7.0
2                     6.4
3                     6.9
4                     5.5

Upvotes: 3

cirofdo

Reputation: 1074

I was writing my answer until I saw David Arenburg great comment, and I adapted my answer with his.

I got this:

df <- iris %>%
  select(c(starts_with("Petal"), Species)) %>%
  gather(key, value, -Species) %>%
  unite(tmp, Species, key, sep=".") %>%
  group_by(tmp) %>%
  mutate(indx = row_number()) %>%
  spread(tmp, value)

I didn't know the unite() function, that was the main difference.

Obs: Since I can't comment I needed to post as an answer, and I can erase this if you feel like I'm stealing your answer.

Upvotes: 0

Spread multiple columns with values by one columne containing key

Answers (2)

For these purposes using a split and base functions is alarmingly faster than piping. Anyone have any feedback on this ???

Related Questions