Reputation: 760
I'm working with data that looks something like this:
AF: AU:
1 MIT Duflo, Esther
2 NBER; NBER Freeman, Richard B.; Gelber, Alexander M.
3 U MI; Cornell U; U VA Bound, John; Lovenheim, Michael F.; Turner, Sarah
4 Harvard U; U Chicago Fryer, Roland G., Jr.; Levitt, Steven D.
5 U OR; U CA, Davis; U British Columbia Lindo, Jason M.; Sanders, Nicholas J.; Oreopoulos, Philip
I have two variables, AF: for affiliation and AU: for authors. Different authors and affiliations are separated with semicolon, I want to use the separate_rows-command and create somthing like this:
AF: AU:
MIT Duflo, Esther
NBER Freeman, Richard B.
NBER Gelber, Alexander M.
U MI Bound, John
Cornell U Lovenheim, Michael F.
U VA Turner, Sarah
Harvard U; Fryer, Roland G., Jr.
U Chicago Levitt, Steven D.
U OR Lindo, Jason M.
U CA, Davis Sanders, Nicholas J.
U British ColumbiaOreopoulos, Philip
The standard version of separate_rows() generates an error message, probably since my data contains NAs:
authaf_spread<-separate_rows(authaf, 1:2, sep=";")
Error: All nested columns must have the same number of elements.
I downloaded and installed the develpment version, which just gives me another error message:
authaf_spread<-separate_rows(authaf, 1:2, sep=";")
Error: No common size for `AF:`, size 3, and `AU:`, size 4.
Call `rlang::last_error()` to see a backtrace
What does this mean and how do I circumvent this error?
If anyone's interested I'm attaching a link to the entire file:
https://www.dropbox.com/s/z456w7ll7v7o79z/authors_affiliations.csv?dl=0
Upvotes: 0
Views: 686
Reputation: 1305
If you call separate_rows
twice, it will work. I used str_trim
from stringr to remove whitespace that appeared before and after the author names and affiliations, and drop_na
from tidyr to remove rows that had NA for both columns.
# Loaded your .csv file as variable 'df'
authors <- df %>%
separate_rows(AF., sep = ";") %>%
separate_rows(AU., sep = ";") %>%
mutate_all(~ str_trim(., side = "both")) %>%
drop_na
# A tibble: 24,877 x 2
AF. AU.
<chr> <chr>
1 MIT Duflo, Esther
2 NBER Freeman, Richard B.
3 NBER Gelber, Alexander M.
4 NBER Freeman, Richard B.
5 NBER Gelber, Alexander M.
6 U MI Bound, John
7 U MI Lovenheim, Michael F.
8 U MI Turner, Sarah
9 Cornell U Bound, John
10 Cornell U Lovenheim, Michael F.
# … with 24,867 more rows
You can also remove rows that are duplicated with author and affiliation by using distinct
.
authors %>% distinct(AF., AU.)
# A tibble: 5,873 x 2
AF. AU.
<chr> <chr>
1 MIT Duflo, Esther
2 NBER Freeman, Richard B.
3 NBER Gelber, Alexander M.
4 U MI Bound, John
5 U MI Lovenheim, Michael F.
6 U MI Turner, Sarah
7 Cornell U Bound, John
8 Cornell U Lovenheim, Michael F.
9 Cornell U Turner, Sarah
10 U VA Bound, John
# … with 5,863 more rows
Upvotes: 2