Reputation: 381
The documentation for tidyr suggests that gather and spread are transitive, but the following example with the "iris" data shows they are not, but it is not clear why. Any clarification would be greatly appreciated
iris.df = as.data.frame(iris)
long.iris.df = iris.df %>% gather(key = feature.measure, value = size, -Species)
w.iris.df = long.iris.df %>% spread(key = feature.measure, value = size, -Species)
I expected the data frame "w.iris.df" to be the same as "iris.df" but received the following error instead:
"Error: Duplicate identifiers for rows (1, 2, 3, 4, 5, 6, 7, 8, 9..."
My general question is how to reverse an application of "gather" on this sort of dataset.
Upvotes: 38
Views: 14477
Reputation: 3045
As the previous answer may not have been sufficiently clear, there's a problem with how you are performing gather
that shows itself when you are trying to spread
.
The issue is that in the process of gathering you are losing track of which feature.measure
belong to which row in the original dataframe, so spread
has no idea how to combine individual values into the the "wide" table again.
iris.df = as.data.frame(iris)
long.iris.df = iris.df %>%
tibble::rowid_to_column() %>%
gather(key = feature.measure, value = size, -Species, -rowid)
#> rowid Species feature.measure size
#> 1 1 setosa Sepal.Length 5.1
#> 2 2 setosa Sepal.Length 4.9
#> 3 3 setosa Sepal.Length 4.7
#> 4 4 setosa Sepal.Length 4.6
#> 5 5 setosa Sepal.Length 5.0
#> 6 6 setosa Sepal.Length 5.4
Now each value in size
preserves its rowid
so you will always be able to recombine it back to the wide dataset (removing unnecessary rowid
):
w.iris.df = long.iris.df %>%
spread(key = feature.measure, value = size) %>%
select(-rowid)
head(w.iris.df)
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width
#> 1 setosa 1.4 0.2 5.1 3.5
#> 2 setosa 1.4 0.2 4.9 3.0
#> 3 setosa 1.3 0.2 4.7 3.2
#> 4 setosa 1.5 0.2 4.6 3.1
#> 5 setosa 1.4 0.2 5.0 3.6
#> 6 setosa 1.7 0.4 5.4 3.9
Upvotes: 0
Reputation: 2950
Hadley's intervention was unsurprisingly perfect... but I ended up mucking with the syntax a bit after that... so for what it's worth, I post the fully operational code (sorry my syntax is a bit different than above):
library(tidyr)
library(dplyr)
wide <-
iris %>%
mutate(row = row_number()) %>%
gather(vars, val, -Species, -row) %>%
spread(vars, val)
head(wide)
# Species row Petal.Length Petal.Width Sepal.Length Sepal.Width
# 1 setosa 1 1.4 0.2 5.1 3.5
# 2 setosa 2 1.4 0.2 4.9 3.0
# 3 setosa 3 1.3 0.2 4.7 3.2
# 4 setosa 4 1.5 0.2 4.6 3.1
# 5 setosa 5 1.4 0.2 5.0 3.6
# 6 setosa 6 1.7 0.4 5.4 3.9
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
They are the same.... just need to reorder if u feel like it...
wide <- wide[,c(3, 4, 5, 6, 1)] ## Reorder and then remove "row" column
and done.
Upvotes: 29