Reputation: 888
I have these data:
df <- data.frame("author" = c("Kardos, NN (Fraunhofer Austria); Laflamme, NN (Fraunhofer Austria); Gallina, NN (Fraunhofer Austria); Sihn, NN (Fraunhofer Austria; TU Wien)",
"Demeter, NN (TU Wien; TU Wien); Derx, NN (TU Wien); Komma, NN (TU Wien); Parajka, NN (TU Wien); Schijven, NN (National Institute for Public Health and the Environment; Utrecht University); Sommer, NN (Medical University of Vienna)",
"Prendl, NN (TU Wien); Schenzel, NN (TU Wien); Hofmann, NN (TU Wien)",
"Müller, NN (TU Wien); Knoll, NN (TU Wien; TU Wien); Gravogl, NN (TU Wien; University of Vienna); Jordan, NN (TU Wien); Eitenberger, NN (TU Wien); Friedbacher, NN (TU Wien); Artner, Werner (TU Wien); Welch, NN M. (TU Wien); Werner, NN (TU Wien)"
))
With a specific regex (which I got from here), I am able to extract each person. This works well:
stringr::str_extract_all(df$author, "\\w+,\\s*\\w+\\s*\\([^()]*(?:\\([^()]*\\)[^()]*)*\\);?")
However, the same regex does not work when I use tidyr::separate_rows()
:
tidyr::separate_rows(df, author, sep = "\\w+,\\s*\\w+\\s*\\([^()]*(?:\\([^()]*\\)[^()]*)*\\);?")
How comes? What is the issue here? How can I use that regex with separate_rows()
?
Upvotes: 1
Views: 234
Reputation: 626932
The point here is that a regex that is used for extracting texts matches the text you need to get. The regex used in a splitting function removes the matches and split the original string in the location of the matches.
You can use
tidyr::separate_rows(df, author, sep = "(?<=\\));\\s*")
See the regex demo
Details
(?<=\))
- a location immediately preceded with )
;
- a semi-colon\s*
- zero or more whitespaces.These matches are found and separate_rows
will split the original strings in the place where the matches occur while removing the match texts.
Upvotes: 2
Reputation: 389055
One way would be to repeat the rows of df
by the lengths
of the extracted values.
values <- stringr::str_extract_all(df$author, "\\w+,\\s*\\w+\\s*\\([^()]*(?:\\([^()]*\\)[^()]*)*\\);?")
result <- transform(df[rep(seq(nrow(df)), lengths(values)), ], author = unlist(values))
Upvotes: 1