Reputation: 6147
Input data :
> head(iris[c(48:50, 98:100), 3:5])
Petal.Length Petal.Width Species
48 1.4 0.2 setosa
49 1.5 0.2 setosa
50 1.4 0.2 setosa
98 4.3 1.3 versicolor
99 3.0 1.1 versicolor
100 4.1 1.3 versicolor
Output data :
setosa.Petal.Length versicolor.Petal.Length setosa.Petal.Width versicolor.Petal.Width
1.4 4.3 0.2 1.3
1.5 3.0 0.2 1.1
1.4 4.1 0.2 1.3
using for example :
spread(
iris%>%mutate(n=row_number()),
key=Species,
value=Petal.Length:Petal.Width)
#or c("Petal.Length", "Petal.Width")`
doesn't work
Upvotes: 2
Views: 1503
Reputation: 2722
Came across this and was curious about the mention on "large datasets". As an avid fan of everything tidyverse I am oftern benchmarking performace as I work with large sets constantly. For this example this is what I came up with, which was alarmingly in the favor of using base package functions over piping with tidyverse calls.
Replicating the iris data set to fake a larger data set.
iris_1000 <- plyr::ldply(1:1000, function(i){
iris
})
Dimensions are 150,000 rows by 5 columns:
dim(iris_1000)
[1] 150000 5
So this would represent a dataset with a size of:
sprintf("%s MB", object.size(iris_1000) * 0.001^2)
[1] "5.401688 MB"
Wrapping the provided anser in a function call for testing but changing so that it applies to all columns, not just those containing 'Petal' and also dropping the index column and returning as a data frame
tidy_fn <- function(){
iris_1000 %>%
gather(key, value, -Species) %>%
unite(tmp, Species, key, sep=".") %>%
group_by(tmp) %>%
mutate(indx = row_number()) %>%
spread(tmp, value) %>%
select(-indx) %>% data.frame
}
Function using mostly base package provided functions and taking an input of data frame as the 'df' variable and the column name that the data is to be aggregated by, or in our case split on
baseish_fn <- function(df, col_split){
df_split <- split(df, df[[col_split]])
df_loop <- lapply(names(df_split), function(i){
iter_df <- df_split[[i]][-which(colnames(df_split[[i]]) == col_split)]
new_names <- sprintf("%s.%s", i, colnames(iter_df))
colnames(iter_df) <- new_names
iter_df
})
names(df_loop) <- NULL
as.data.frame(df_loop)
}
Now the speed comparisons were surprising... One note is that the column names in the baseish function will be arranged to match the returned col arrangement of the tudy function for later comparison of values so run once to get that
tidy_val_test <- tidy_fn()
microbenchmark::microbenchmark(
tidyverse = tidy_fn(),
`base-ish` = baseish_fn(iris_1000, "Species")[colnames(tidy_val_test)],times = 100L
)
Unit: milliseconds
expr min lq mean median uq max neval
tidyverse 378.46753 419.93116 447.8020 438.5375 466.78367 718.9666 100
base-ish 76.57918 83.21915 92.8109 87.8329 94.58085 342.2290 100
And to ensure they produced identical outputs:
all(mapply(function(i){
identical(tidyverse[[i]], `base-ish`[[i]])
},colnames(tidyverse)))
[1] TRUE
head(`base-ish`[1:4,1:5])
setosa.Sepal.Length setosa.Sepal.Width setosa.Petal.Length setosa.Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
versicolor.Sepal.Length
1 7.0
2 6.4
3 6.9
4 5.5
Upvotes: 3
Reputation: 1074
I was writing my answer until I saw David Arenburg great comment, and I adapted my answer with his.
I got this:
df <- iris %>%
select(c(starts_with("Petal"), Species)) %>%
gather(key, value, -Species) %>%
unite(tmp, Species, key, sep=".") %>%
group_by(tmp) %>%
mutate(indx = row_number()) %>%
spread(tmp, value)
I didn't know the unite()
function, that was the main difference.
Obs: Since I can't comment I needed to post as an answer, and I can erase this if you feel like I'm stealing your answer.
Upvotes: 0