Pancho Mulongeni
Pancho Mulongeni

Reputation: 1

Using spread with duplicate identifiers gives sparse matrix with NAs

A user asked a question in github, https://github.com/tidyverse/tidyr/issues/41 and I see that Hadley identified this as a bug. However, there was no solution given. I still experience this problem, when I have duplicate identifiers on my data frame

structure(list(key = c("a", "b", "c", "d", "c"), value = c(1, 
2, 3, 2, 4)), .Names = c("key", "value"), row.names = c(NA, -5L
), class = c("tbl_df", "tbl", "data.frame"))

Now when I use the spread from dplyr, I still have a sparse matrix with NAs, because I happen to have duplicate identifiers

dftest %>% spread(key,value)
Error: Duplicate identifiers for rows (3, 5)

So I add an ID row

> dftest$id<-seq(1,5)
> dftest %>% spread(key,value)
# A tibble: 5 x 5
     id     a     b     c     d
  <int> <dbl> <dbl> <dbl> <dbl>
1     1    1.   NA    NA    NA 
2     2   NA     2.   NA    NA 
3     3   NA    NA     3.   NA 
4     4   NA    NA    NA     2.
5     5   NA    NA     4.   NA 

But the diagonal data frame is not what I want. I would like one where the top row of the output of spread reads 1,2,3,2 in row 1. Then the value in colum c will fall right underneath, in row 2. That is to say, I have no use for a diagonal matrix with NAs. Am I missing something? I ask with humility.

Upvotes: 0

Views: 392

Answers (1)

Nik Muhammad Naim
Nik Muhammad Naim

Reputation: 578

You're so closed to getting the right output.

Using dftest from your original input.

Method:

dftest %>% group_by(key) %>% mutate(id = 1:length(key)) %>% spread(key, value)

Output:

# A tibble: 2 x 5
     id     a     b     c     d
  <int> <dbl> <dbl> <dbl> <dbl>
1     1    1.    2.    3.    2.
2     2   NA    NA     4.   NA

Upvotes: 2

Related Questions