Using the spread function correctly

Question

My data "IDrel" has three variables, "ID", "category" and "value", it looks something like this:

    ID  category value
1       AU: Duflo, Esther
2   2   AF: MIT
3   3   SO: American Economic Journal: Applied Economics, 2(2), April 2010, pp.
4   4   IS: 1945-7782
5   5   AV: http://www.aeaweb.org/aej-applied/
6   6   DT: Journal Article
7   7   PY: 2010
8   8   AN: 1094392
9   9   TI: Prize Structure and Information in Tournaments: Experimental Evidence
10  10  AU: Freeman, Richard B.; Gelber, Alexander M.
11  11  AF: NBER; NBER
12  12  SO: American Economic Journal: Applied Economics, 2

There are 9 unique values of "category", and I want to make each of them into their own variable, with values represented by the "value"-variable. In theory, this should increase the number of variables to 11 and reduce the number of observations by something like 9/10.

The desired output looks something like this:

ID    AU:            SO.                           IS:          [more variables]

    1     Duflo, Esther   American Economic Journal    1945-7782    [more values]
    2     Freeman, Richard American Economic Journal:  [...]

I try to achieve this via the spread-function:

IDspread<-spread(IDrel, category, value)

I actually do get 11 variables, but I get the same number of observations. The results seem nonsensical to me, these are the first 10 observations:

IS:  PY:
1       
2       
3       
4 1945-7782

Am I using this function correctly? Is there an alternative?

I realize the ID variable doesn't reflect which category/value belongs with which observation, but I need to reorder the data in order to "get" the correct index for each observation ^^

akrun · Accepted Answer

The issue is the use of ID variable. It is having unique values. So, if we keep that variable while doing spread, the number of rows in the spread output would be equal to the number of unique elements in 'ID' and combinations that doesn't exist would be NA. Remove the 'ID', create a sequence column grouped by 'category' (to take care of duplicates) and then spread

library(dplyr)
library(tidyr)
IDrel %>%        
   group_by(category) %>% 
   mutate(ID = row_number()) %>% # update the ID column
   spread(category, value)
# A tibble: 2 x 10
#    ID `AF:`    `AN:`  `AU:`                `AV:`             `DT:`     `IS:`  `PY:` `SO:`                          `TI:`                         
#                                                                                                 
#1     1 MIT      10943… Duflo, Esther        http://www.aeawe… Journal … 1945-… 2010  American Economic Journal: Ap… Prize Structure and Informati…
#2     2 NBER; N…    Freeman, Richard B.…                          American Economic Journal: Ap…

data

IDrel <- structure(list(ID = 1:12, category = c("AU:", "AF:", "SO:", "IS:", 
"AV:", "DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:"), value = c("Duflo, Esther", 
"MIT", "American Economic Journal: Applied Economics, 2(2), April 2010, pp.", 
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article", 
"2010", "1094392", "Prize Structure and Information in Tournaments: Experimental Evidence", 
"Freeman, Richard B.; Gelber, Alexander M.", "NBER; NBER", "American Economic Journal: Applied Economics, 2"
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", 
"6", "7", "8", "9", "10", "11", "12"))

Using the spread function correctly

Answers (2)

data

Related Questions