Reputation: 339
I have a dataframe where one column is filled with character strings structured as follows: surname, given name XX, surname, given name XX, etc. The name combinations are thus divided by an "XX," at the end.
I am looking to
this would look as follows:
example <- data.frame(id = c(1,2,3),
names = c("Russell-Moyle, Lloyd XX, Lucas, Caroline XX, Hobhouse, Wera XX", "Benn, Hilary XX, Sobel, Alex XX, West, Catherine XX, Doughty, Stephen XX", "Oswald, Kirsten XX, Thompson, Owen XX, Dorans, Allan XX")
)
example
#current output:
#1 1 Russell-Moyle, Lloyd XX, Lucas, Caroline XX, Hobhouse, Wera XX
#2 2 Benn, Hilary XX, Sobel, Alex XX, West, Catherine XX, Doughty, Stephen XX
#3 3 Oswald, Kirsten XX, Thompson, Owen XX, Dorans, Allan XX
#ideal output:
id names
1 Lloyd Russel-Moyle
1 Caroline Lucas
1 Were Hobhouse
2 Hilary Benn
2 Alex Sobel
2 Catherine West
2 Stephan Doughty
3 Kirsten Oswald
3 Owen Thompson
3 Allan Dorans
Could anyone help me out? Thanks!!
Upvotes: 2
Views: 79
Reputation: 5747
You can do this with some functions from the tidyr
package.
library(tidyr)
library(dplyr)
example %>%
separate_rows(names, sep = "( *)XX(,*)( *)") %>% # create one row per name
separate(names, into = c("last", "first"), sep = ", ") %>% # separate names into first and last
unite(names, first, last, sep = " ")
# A tibble: 10 x 2
id names
<dbl> <chr>
1 1 Lloyd Russell-Moyle
2 1 Caroline Lucas
3 1 Wera Hobhouse
4 2 Hilary Benn
5 2 Alex Sobel
6 2 Catherine West
7 2 Stephen Doughty
8 3 Kirsten Oswald
9 3 Owen Thompson
10 3 Allan Dorans
Here is a break down of the regular expression in the sep =
argument of separate_rows()
:
( *) # match a sequence starting with 0 or more spaces
XX # followed by XX
(,*) # followed by 0 or more commas
( *) # followed by 0 or more spaces
Upvotes: 1