Magnus
Magnus

Reputation: 760

How to deal with messy data - with variables not marked as missing

My data is quite messy but for most part, it looks something like this, with a repeating sequence of 9 categories:

9   TI: Prize Structure and Information in Tournaments: Experimental Evidence
10  AU: Freeman, Richard B.; Gelber, Alexander M.
11  AF: NBER; NBER
12  SO: American Economic Journal: Applied Economics, 2(1), January 2010, pp. 149-64
13  IS: 1945-7782
14  AV: http://www.aeaweb.org/aej-applied/
15  DT: Journal Article
16  PY: 2010
17  AN: 1075725

However, some of the sequences are lacking certain variables/variable numbers which are not marked as NA. An example would be the very first sequence, lacking the title (TI:)-variable:

1   AU: Duflo, Esther
2   AF: MIT
3   SO: American Economic Journal: Applied Economics, 2(2), April 2010, pp.
4   IS: 1945-7782
5   AV: http://www.aeaweb.org/aej-applied/
6   DT: Journal Article
7   PY: 2010
8   AN: 1094392

Every sequence (or so I gather) will end with Article Number (AN). Apart from this, variables/variable values seem to be missing pretty much at random.

If I use ordinary methods (assigning ID-numbers by group and then the spread function) this would change the order in my data, causing the title for the second sequence to appear in the first, and so forth. I want to make the data appear in this format, with 'NA' for each missing variable value in each sequence:

Sequence    TI:   AU:   AF:   SO:   IS:   AV:   DT:   PY:    AN:       
    1      'NA'  Duflo MIT   AEJ   1945  aea   jour  2010   1094392
    2       Priz Freem NBER  AEJ   1945  aea   jour  2010   1075725

Is there any way I can do this with normal r functions?

I don't know if this will help (this data is quite regular) but I'll attach the first 27 observations:

structure(list(category = c("AU:", "AF:", "SO:", "IS:", "AV:", 
"DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:", "IS:", "AV:", 
"DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:", "IS:", "AV:", 
"DT:", "PY:", "AN:", "TI:"), value = c("Duflo, Esther", "MIT", 
"American Economic Journal: Applied Economics, 2(2), April 2010, pp.", 
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article", 
"2010", "1094392", "Prize Structure and Information in Tournaments: Experimental Evidence", 
"Freeman, Richard B.; Gelber, Alexander M.", "NBER; NBER", "American Economic Journal: Applied Economics, 2(1), January 2010, pp. 149-64", 
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article", 
"2010", "1075725", "Why Have College Completion Rates Declined? An Analysis of Changing Student Preparation and Collegiate Resources", 
"Bound, John; Lovenheim, Michael F.; Turner, Sarah", "U MI; Cornell U; U VA", 
"American Economic Journal: Applied Economics, 2(3), July 2010, pp. 129-57", 
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article", 
"2010", "1105792", "An Empirical Analysis of the Gender Gap in Mathematics"
)), row.names = c(NA, 27L), class = "data.frame")

I will also ad a link to the file I'm currently working with:

https://www.dropbox.com/s/wwaimr21eld2jg6/relevant.csv?dl=0

Please note that one of the values for "category" reads "AN: Perspectives from a Cluster Analysis[...]" and that this value is incorrect. I have not yet managed to grep/delete this entry and I'm just too tired right now, sorry about that.

Upvotes: 1

Views: 77

Answers (2)

Shree
Shree

Reputation: 11140

Based on "Every sequence (or so I gather) will end with Article Number (AN). Apart from this, variables/variable values seem to be missing pretty much at random" I am assuming that "AN:" is never missing -

df %>% 
  mutate(
    ID = cumsum(category == "AN:") - (category == "AN:") + 1
  ) %>% 
  spread(category, value)

# A tibble: 4 x 10
     ID `AF:`     `AN:`  `AU:`            `AV:`        `DT:`    `IS:` `PY:` `SO:`                 `TI:`                        
  <dbl> <chr>     <chr>  <chr>            <chr>        <chr>    <chr> <chr> <chr>                 <chr>                        
1     1 MIT       10943~ Duflo, Esther    http://www.~ Journal~ 1945~ 2010  American Economic Jo~ <NA>                         
2     2 NBER; NB~ 10757~ Freeman, Richar~ http://www.~ Journal~ 1945~ 2010  American Economic Jo~ Prize Structure and Informat~
3     3 U MI; Co~ 11057~ Bound, John; Lo~ http://www.~ Journal~ 1945~ 2010  American Economic Jo~ Why Have College Completion ~
4     4 <NA>      <NA>   <NA>             <NA>         <NA>     <NA>  <NA>  <NA>                  An Empirical Analysis of the~

Note -

Here's a way to test if "AN:" is a good indicator for end of sequence - Below code counts the rows between consecutive "AN:". If all intervals are <= 9 then it returns FALSE which means that "AN:" is a good indicator for end of sequences. If not, then I'd suggest hosting the actual file somewhere so that people can look at the whole dataset.

df %>% 
  group_split(ID = cumsum(category == "AN:") - (category == "AN:") + 1) %>% 
  sapply(nrow) %>% 
  {any(. > 9)}

[1] FALSE

Upvotes: 3

David Klotz
David Klotz

Reputation: 2431

Suppose you call that dataset dt:

library(tidyverse)
dt %>% mutate(Sequence = cumsum(row_number() == 1 | lag(category) == 'AN:')) %>% 
    group_by(Sequence) %>% 
    mutate(seq_ind = row_number(), n = n()) %>% 
    ungroup %>% 
    arrange(desc(n), seq_ind) %>% #This will put sequences with the most number of fields first
    mutate(category = as_factor(category)) %>% #column orders are defined based on the order of the largest sequence
    select(-seq_ind, -n) %>% 
    spread(key = category, value = value)

Upvotes: 1

Related Questions