Magnus
Magnus

Reputation: 760

Using the spread function correctly

My data "IDrel" has three variables, "ID", "category" and "value", it looks something like this:

    ID  category value
1       AU: Duflo, Esther
2   2   AF: MIT
3   3   SO: American Economic Journal: Applied Economics, 2(2), April 2010, pp.
4   4   IS: 1945-7782
5   5   AV: http://www.aeaweb.org/aej-applied/
6   6   DT: Journal Article
7   7   PY: 2010
8   8   AN: 1094392
9   9   TI: Prize Structure and Information in Tournaments: Experimental Evidence
10  10  AU: Freeman, Richard B.; Gelber, Alexander M.
11  11  AF: NBER; NBER
12  12  SO: American Economic Journal: Applied Economics, 2

There are 9 unique values of "category", and I want to make each of them into their own variable, with values represented by the "value"-variable. In theory, this should increase the number of variables to 11 and reduce the number of observations by something like 9/10.

The desired output looks something like this:

ID    AU:            SO.                           IS:          [more variables]

    1     Duflo, Esther   American Economic Journal    1945-7782    [more values]
    2     Freeman, Richard American Economic Journal:  [...]     

I try to achieve this via the spread-function:

IDspread<-spread(IDrel, category, value)

I actually do get 11 variables, but I get the same number of observations. The results seem nonsensical to me, these are the first 10 observations:

IS:  PY:
1      <NA> <NA>
2      <NA> <NA>
3      <NA> <NA>
4 1945-7782 <NA>

Am I using this function correctly? Is there an alternative?

I realize the ID variable doesn't reflect which category/value belongs with which observation, but I need to reorder the data in order to "get" the correct index for each observation ^^

Upvotes: 1

Views: 61

Answers (2)

Uwe
Uwe

Reputation: 42544

This not a direct answer to the question on spread() but the data looks pretty damn similar to DCF format.

DCF is a simple format for storing databases in plain text files and can be read directly from file using base R's read.dcf() function.

So, if the data would be stored in file IDrel.dcf

AU: Duflo, Esther
AF: MIT
SO: American Economic Journal: Applied Economics, 2(2), April 2010, pp.
IS: 1945-7782
AV: http://www.aeaweb.org/aej-applied/
DT: Journal Article
PY: 2010
AN: 1094392
TI: Prize Structure and Information in Tournaments: Experimental Evidence

AU: Freeman, Richard B.; Gelber, Alexander M.
AF: NBER; NBER
SO: American Economic Journal: Applied Economics, 2

the command

read.dcf("IDrel.dcf", all = TRUE)

would return the data.frame

                                         AU         AF                                                                  SO        IS
1                             Duflo, Esther        MIT American Economic Journal: Applied Economics, 2(2), April 2010, pp. 1945-7782
2 Freeman, Richard B.; Gelber, Alexander M. NBER; NBER                     American Economic Journal: Applied Economics, 2      <NA>
                                  AV              DT   PY      AN                                                                    TI
1 http://www.aeaweb.org/aej-applied/ Journal Article 2010 1094392 Prize Structure and Information in Tournaments: Experimental Evidence
2                               <NA>            <NA> <NA>    <NA>

Note that the empty line indicates the beginning of a new record.

The DCF rules as implemented in R are explained in the help file ?read.dcf.

Upvotes: 1

akrun
akrun

Reputation: 886948

The issue is the use of ID variable. It is having unique values. So, if we keep that variable while doing spread, the number of rows in the spread output would be equal to the number of unique elements in 'ID' and combinations that doesn't exist would be NA. Remove the 'ID', create a sequence column grouped by 'category' (to take care of duplicates) and then spread

library(dplyr)
library(tidyr)
IDrel %>%        
   group_by(category) %>% 
   mutate(ID = row_number()) %>% # update the ID column
   spread(category, value)
# A tibble: 2 x 10
#    ID `AF:`    `AN:`  `AU:`                `AV:`             `DT:`     `IS:`  `PY:` `SO:`                          `TI:`                         
#  <int> <chr>    <chr>  <chr>                <chr>             <chr>     <chr>  <chr> <chr>                          <chr>                         
#1     1 MIT      10943… Duflo, Esther        http://www.aeawe… Journal … 1945-… 2010  American Economic Journal: Ap… Prize Structure and Informati…
#2     2 NBER; N… <NA>   Freeman, Richard B.… <NA>              <NA>      <NA>   <NA>  American Economic Journal: Ap… <NA>   

data

IDrel <- structure(list(ID = 1:12, category = c("AU:", "AF:", "SO:", "IS:", 
"AV:", "DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:"), value = c("Duflo, Esther", 
"MIT", "American Economic Journal: Applied Economics, 2(2), April 2010, pp.", 
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article", 
"2010", "1094392", "Prize Structure and Information in Tournaments: Experimental Evidence", 
"Freeman, Richard B.; Gelber, Alexander M.", "NBER; NBER", "American Economic Journal: Applied Economics, 2"
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", 
"6", "7", "8", "9", "10", "11", "12"))

Upvotes: 2

Related Questions