Reputation: 760
My data "IDrel" has three variables, "ID", "category" and "value", it looks something like this:
ID category value
1 AU: Duflo, Esther
2 2 AF: MIT
3 3 SO: American Economic Journal: Applied Economics, 2(2), April 2010, pp.
4 4 IS: 1945-7782
5 5 AV: http://www.aeaweb.org/aej-applied/
6 6 DT: Journal Article
7 7 PY: 2010
8 8 AN: 1094392
9 9 TI: Prize Structure and Information in Tournaments: Experimental Evidence
10 10 AU: Freeman, Richard B.; Gelber, Alexander M.
11 11 AF: NBER; NBER
12 12 SO: American Economic Journal: Applied Economics, 2
There are 9 unique values of "category", and I want to make each of them into their own variable, with values represented by the "value"-variable. In theory, this should increase the number of variables to 11 and reduce the number of observations by something like 9/10.
The desired output looks something like this:
ID AU: SO. IS: [more variables]
1 Duflo, Esther American Economic Journal 1945-7782 [more values]
2 Freeman, Richard American Economic Journal: [...]
I try to achieve this via the spread-function:
IDspread<-spread(IDrel, category, value)
I actually do get 11 variables, but I get the same number of observations. The results seem nonsensical to me, these are the first 10 observations:
IS: PY:
1 <NA> <NA>
2 <NA> <NA>
3 <NA> <NA>
4 1945-7782 <NA>
Am I using this function correctly? Is there an alternative?
I realize the ID variable doesn't reflect which category/value belongs with which observation, but I need to reorder the data in order to "get" the correct index for each observation ^^
Upvotes: 1
Views: 61
Reputation: 42544
This not a direct answer to the question on spread()
but the data looks pretty damn similar to DCF
format.
DCF
is a simple format for storing databases in plain text files and can be read directly from file using base R's read.dcf()
function.
So, if the data would be stored in file IDrel.dcf
AU: Duflo, Esther AF: MIT SO: American Economic Journal: Applied Economics, 2(2), April 2010, pp. IS: 1945-7782 AV: http://www.aeaweb.org/aej-applied/ DT: Journal Article PY: 2010 AN: 1094392 TI: Prize Structure and Information in Tournaments: Experimental Evidence AU: Freeman, Richard B.; Gelber, Alexander M. AF: NBER; NBER SO: American Economic Journal: Applied Economics, 2
the command
read.dcf("IDrel.dcf", all = TRUE)
would return the data.frame
AU AF SO IS 1 Duflo, Esther MIT American Economic Journal: Applied Economics, 2(2), April 2010, pp. 1945-7782 2 Freeman, Richard B.; Gelber, Alexander M. NBER; NBER American Economic Journal: Applied Economics, 2 <NA> AV DT PY AN TI 1 http://www.aeaweb.org/aej-applied/ Journal Article 2010 1094392 Prize Structure and Information in Tournaments: Experimental Evidence 2 <NA> <NA> <NA> <NA>
Note that the empty line indicates the beginning of a new record.
The DCF
rules as implemented in R are explained in the help file ?read.dcf
.
Upvotes: 1
Reputation: 886948
The issue is the use of ID
variable. It is having unique values. So, if we keep that variable while doing spread
, the number of rows in the spread
output would be equal to the number of unique elements in 'ID' and combinations that doesn't exist would be NA
. Remove the 'ID', create a sequence column grouped by 'category' (to take care of duplicates) and then spread
library(dplyr)
library(tidyr)
IDrel %>%
group_by(category) %>%
mutate(ID = row_number()) %>% # update the ID column
spread(category, value)
# A tibble: 2 x 10
# ID `AF:` `AN:` `AU:` `AV:` `DT:` `IS:` `PY:` `SO:` `TI:`
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 1 MIT 10943… Duflo, Esther http://www.aeawe… Journal … 1945-… 2010 American Economic Journal: Ap… Prize Structure and Informati…
#2 2 NBER; N… <NA> Freeman, Richard B.… <NA> <NA> <NA> <NA> American Economic Journal: Ap… <NA>
IDrel <- structure(list(ID = 1:12, category = c("AU:", "AF:", "SO:", "IS:",
"AV:", "DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:"), value = c("Duflo, Esther",
"MIT", "American Economic Journal: Applied Economics, 2(2), April 2010, pp.",
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article",
"2010", "1094392", "Prize Structure and Information in Tournaments: Experimental Evidence",
"Freeman, Richard B.; Gelber, Alexander M.", "NBER; NBER", "American Economic Journal: Applied Economics, 2"
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10", "11", "12"))
Upvotes: 2