Sacha Epskamp
Sacha Epskamp

Reputation: 47541

Split string on comma following a specific word

I have a vector with names, e.g.:

names <- "Jansen, A., Karel, A., Jong, A. de, Pietersen, K."

And I want to split this per name. In this case, I need to split the vector on ., and the comma following de (That name would be A. De Jong, which is typical in Dutch).

Right now I do:

 strsplit(names,split="\\.\\,|\\<de\\>,")

But this also removes the de from the name:

[[1]]
[1] "Jansen, A"      " Karel, A"      " Jong, A. "     " Pietersen, K."

How can I obtain the following as result?

[[1]]
[1] "Jansen, A"      " Karel, A"      " Jong, A. de"     " Pietersen, K."

Upvotes: 4

Views: 3362

Answers (3)

Richie Cotton
Richie Cotton

Reputation: 121057

polishchuk's regex needs two modifications to make it work in R.

Firstly, the backslash needs escaping. Secondly, the call to strsplit needs the argument perl = TRUE to enable lookbehind.

strsplit(names, split = "\\.,|(?<=de)", perl = TRUE)

gives the answer Sacha asked for.

Notice though that this still includes a dot in de Jong's name, and it isn't extensible to alternatives like van, der, etc. I suggest the following alternative.

names <- "Jansen, A., Karel, A., Jong, A. de, Pietersen, K., Helsing, A. van"
#split on every comma
first_last <- strsplit(names, split = ",")[[1]]
#rearrange into a matrix with the first column representing last names, 
#and the second column representing initials
first_last <- matrix(first_last, byrow = TRUE, ncol = 2) 
#clean up: remove leading spaces and dots
first_last <- gsub("^ ", "", first_last)
first_last <- gsub("\\.", "", first_last)
#combine columns again
apply(first_last, 1, paste, collapse = ", ")

Upvotes: 5

Sacha Epskamp
Sacha Epskamp

Reputation: 47541

I just figured out a really easy workaround for this problem which I am posting here for reference. Simply gsub the string first to something else that is easier to split:

names <- "Jansen, A., Karel, A., Jong, A. de, Pietersen, K."

names <- gsub("\\<de\\>,","de.,",names)
strsplit(names,split="\\.\\,")
[[1]]
[1] "Jansen, A"      " Karel, A"      " Jong, A. de"   " Pietersen, K."

I guess this requires a seperate gsub() statement for each way this can occur (in Dutch you have van, der, de, te, ten, and more), so it isn't ideal, but gets the job done.

Upvotes: 1

Kirill Polishchuk
Kirill Polishchuk

Reputation: 56162

Try this regex: \.,|(?<=de), with look-behind.

It will match:

Jansen, A., Karel, A., Jong, A. de, Pietersen, K.

Upvotes: 3

Related Questions