Reputation: 1049
I am using tidyr
from R
and am running into an issue when using the spread()
command with duplicate identifiers.
Here is a mock example that illustrates the problem:
X = data.frame(name=c("Eric","Bob","Mark","Bob","Bob","Mark","Eric","Bob","Mark"),
metric=c("height","height","height","weight","weight","weight","grade","grade","grade"),
values=c(6,5,4,120,118,180,"A","B","C"),
stringsAsFactors=FALSE)
tidyr::spread(X,metric,values)
So when I run this command I get the following error:
Error: Duplicate identifiers for rows (4, 5)
which makes sense why its an error because Bob is recorded twice for weight. It's actually nota mistake because Bob did have his weight recorded twice. What I would like to be able to do is have run the command and have it it give me back the following:
name height weight grade
Eric 6 NA A
Bob 5 120 B
Bob 5 118 B
Mark 4 180 C
Is spread not the command I should be using to accomplish this? And if there isn't an easy solution is there a simple way to remove the record with lowest weight for duplicates when running the spread()
command?
Upvotes: 2
Views: 89
Reputation: 36076
After making unique identifiers, which can be done by making a new variable representing the index within each group, you can use fill
to fill the second "Bob" row with a duplicate value for "height" and "grade".
You can remove the index variable at the end via select
.
library(dplyr)
library(tidyr)
X %>%
group_by(name, metric) %>%
mutate(row = row_number() ) %>%
spread(metric, values) %>%
fill(grade, height) %>%
select(-row)
# A tibble: 4 x 4
# Groups: name [3]
name grade height weight
<chr> <chr> <chr> <chr>
1 Bob B 5 120
2 Bob B 5 118
3 Eric A 6 <NA>
4 Mark C 4 180
To filter
to the maximum value of each name
/metric
group:
X %>%
group_by(name, metric) %>%
filter(values == max(values)) %>%
spread(metric, values)
# A tibble: 3 x 4
# Groups: name [3]
name grade height weight
* <chr> <chr> <chr> <chr>
1 Bob B 5 120
2 Eric A 6 <NA>
3 Mark C 4 180
Upvotes: 2