Reputation: 760
My data is quite messy but for most part, it looks something like this, with a repeating sequence of 9 categories:
9 TI: Prize Structure and Information in Tournaments: Experimental Evidence
10 AU: Freeman, Richard B.; Gelber, Alexander M.
11 AF: NBER; NBER
12 SO: American Economic Journal: Applied Economics, 2(1), January 2010, pp. 149-64
13 IS: 1945-7782
14 AV: http://www.aeaweb.org/aej-applied/
15 DT: Journal Article
16 PY: 2010
17 AN: 1075725
However, some of the sequences are lacking certain variables/variable numbers which are not marked as NA. An example would be the very first sequence, lacking the title (TI:)-variable:
1 AU: Duflo, Esther
2 AF: MIT
3 SO: American Economic Journal: Applied Economics, 2(2), April 2010, pp.
4 IS: 1945-7782
5 AV: http://www.aeaweb.org/aej-applied/
6 DT: Journal Article
7 PY: 2010
8 AN: 1094392
Every sequence (or so I gather) will end with Article Number (AN). Apart from this, variables/variable values seem to be missing pretty much at random.
If I use ordinary methods (assigning ID-numbers by group and then the spread function) this would change the order in my data, causing the title for the second sequence to appear in the first, and so forth. I want to make the data appear in this format, with 'NA' for each missing variable value in each sequence:
Sequence TI: AU: AF: SO: IS: AV: DT: PY: AN:
1 'NA' Duflo MIT AEJ 1945 aea jour 2010 1094392
2 Priz Freem NBER AEJ 1945 aea jour 2010 1075725
Is there any way I can do this with normal r functions?
I don't know if this will help (this data is quite regular) but I'll attach the first 27 observations:
structure(list(category = c("AU:", "AF:", "SO:", "IS:", "AV:",
"DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:", "IS:", "AV:",
"DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:", "IS:", "AV:",
"DT:", "PY:", "AN:", "TI:"), value = c("Duflo, Esther", "MIT",
"American Economic Journal: Applied Economics, 2(2), April 2010, pp.",
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article",
"2010", "1094392", "Prize Structure and Information in Tournaments: Experimental Evidence",
"Freeman, Richard B.; Gelber, Alexander M.", "NBER; NBER", "American Economic Journal: Applied Economics, 2(1), January 2010, pp. 149-64",
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article",
"2010", "1075725", "Why Have College Completion Rates Declined? An Analysis of Changing Student Preparation and Collegiate Resources",
"Bound, John; Lovenheim, Michael F.; Turner, Sarah", "U MI; Cornell U; U VA",
"American Economic Journal: Applied Economics, 2(3), July 2010, pp. 129-57",
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article",
"2010", "1105792", "An Empirical Analysis of the Gender Gap in Mathematics"
)), row.names = c(NA, 27L), class = "data.frame")
I will also ad a link to the file I'm currently working with:
https://www.dropbox.com/s/wwaimr21eld2jg6/relevant.csv?dl=0
Please note that one of the values for "category" reads "AN: Perspectives from a Cluster Analysis[...]" and that this value is incorrect. I have not yet managed to grep/delete this entry and I'm just too tired right now, sorry about that.
Upvotes: 1
Views: 77
Reputation: 11140
Based on "Every sequence (or so I gather) will end with Article Number (AN). Apart from this, variables/variable values seem to be missing pretty much at random" I am assuming that "AN:"
is never missing -
df %>%
mutate(
ID = cumsum(category == "AN:") - (category == "AN:") + 1
) %>%
spread(category, value)
# A tibble: 4 x 10
ID `AF:` `AN:` `AU:` `AV:` `DT:` `IS:` `PY:` `SO:` `TI:`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 MIT 10943~ Duflo, Esther http://www.~ Journal~ 1945~ 2010 American Economic Jo~ <NA>
2 2 NBER; NB~ 10757~ Freeman, Richar~ http://www.~ Journal~ 1945~ 2010 American Economic Jo~ Prize Structure and Informat~
3 3 U MI; Co~ 11057~ Bound, John; Lo~ http://www.~ Journal~ 1945~ 2010 American Economic Jo~ Why Have College Completion ~
4 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> An Empirical Analysis of the~
Here's a way to test if "AN:" is a good indicator for end of sequence - Below code counts the rows between consecutive "AN:". If all intervals are <= 9
then it returns FALSE
which means that "AN:" is a good indicator for end of sequences. If not, then I'd suggest hosting the actual file somewhere so that people can look at the whole dataset.
df %>%
group_split(ID = cumsum(category == "AN:") - (category == "AN:") + 1) %>%
sapply(nrow) %>%
{any(. > 9)}
[1] FALSE
Upvotes: 3
Reputation: 2431
Suppose you call that dataset dt
:
library(tidyverse)
dt %>% mutate(Sequence = cumsum(row_number() == 1 | lag(category) == 'AN:')) %>%
group_by(Sequence) %>%
mutate(seq_ind = row_number(), n = n()) %>%
ungroup %>%
arrange(desc(n), seq_ind) %>% #This will put sequences with the most number of fields first
mutate(category = as_factor(category)) %>% #column orders are defined based on the order of the largest sequence
select(-seq_ind, -n) %>%
spread(key = category, value = value)
Upvotes: 1