anpami
anpami

Reputation: 888

Using regex in tidyR separate_rows() and its "sep"-attribute does not work

I have these data:

df <- data.frame("author" = c("Kardos, NN (Fraunhofer Austria); Laflamme, NN (Fraunhofer Austria); Gallina, NN (Fraunhofer Austria); Sihn, NN (Fraunhofer Austria; TU Wien)", 
        "Demeter, NN (TU Wien; TU Wien); Derx, NN (TU Wien); Komma, NN (TU Wien); Parajka, NN (TU Wien); Schijven, NN (National Institute for Public Health and the Environment; Utrecht University); Sommer, NN (Medical University of Vienna)",
        "Prendl, NN (TU Wien); Schenzel, NN (TU Wien); Hofmann, NN (TU Wien)", 
        "Müller, NN (TU Wien); Knoll, NN (TU Wien; TU Wien); Gravogl, NN (TU Wien; University of Vienna); Jordan, NN (TU Wien); Eitenberger, NN (TU Wien); Friedbacher, NN (TU Wien); Artner, Werner (TU Wien); Welch, NN M. (TU Wien); Werner, NN (TU Wien)"
))

With a specific regex (which I got from here), I am able to extract each person. This works well:

stringr::str_extract_all(df$author, "\\w+,\\s*\\w+\\s*\\([^()]*(?:\\([^()]*\\)[^()]*)*\\);?")

However, the same regex does not work when I use tidyr::separate_rows():

tidyr::separate_rows(df, author, sep = "\\w+,\\s*\\w+\\s*\\([^()]*(?:\\([^()]*\\)[^()]*)*\\);?")

How comes? What is the issue here? How can I use that regex with separate_rows()?

Upvotes: 1

Views: 234

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626932

The point here is that a regex that is used for extracting texts matches the text you need to get. The regex used in a splitting function removes the matches and split the original string in the location of the matches.

You can use

tidyr::separate_rows(df, author, sep = "(?<=\\));\\s*")

See the regex demo

Details

  • (?<=\)) - a location immediately preceded with )
  • ; - a semi-colon
  • \s* - zero or more whitespaces.

These matches are found and separate_rows will split the original strings in the place where the matches occur while removing the match texts.

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 389055

One way would be to repeat the rows of df by the lengths of the extracted values.

values <- stringr::str_extract_all(df$author, "\\w+,\\s*\\w+\\s*\\([^()]*(?:\\([^()]*\\)[^()]*)*\\);?")

result <- transform(df[rep(seq(nrow(df)), lengths(values)), ], author = unlist(values))

Upvotes: 1

Related Questions