Reputation: 47541
I have a vector with names, e.g.:
names <- "Jansen, A., Karel, A., Jong, A. de, Pietersen, K."
And I want to split this per name. In this case, I need to split the vector on .,
and the comma following de
(That name would be A. De Jong
, which is typical in Dutch).
Right now I do:
strsplit(names,split="\\.\\,|\\<de\\>,")
But this also removes the de
from the name:
[[1]]
[1] "Jansen, A" " Karel, A" " Jong, A. " " Pietersen, K."
How can I obtain the following as result?
[[1]]
[1] "Jansen, A" " Karel, A" " Jong, A. de" " Pietersen, K."
Upvotes: 4
Views: 3362
Reputation: 121057
polishchuk's regex needs two modifications to make it work in R.
Firstly, the backslash needs escaping. Secondly, the call to strsplit
needs the argument perl = TRUE
to enable lookbehind.
strsplit(names, split = "\\.,|(?<=de)", perl = TRUE)
gives the answer Sacha asked for.
Notice though that this still includes a dot in de Jong's name, and it isn't extensible to alternatives like van, der, etc. I suggest the following alternative.
names <- "Jansen, A., Karel, A., Jong, A. de, Pietersen, K., Helsing, A. van"
#split on every comma
first_last <- strsplit(names, split = ",")[[1]]
#rearrange into a matrix with the first column representing last names,
#and the second column representing initials
first_last <- matrix(first_last, byrow = TRUE, ncol = 2)
#clean up: remove leading spaces and dots
first_last <- gsub("^ ", "", first_last)
first_last <- gsub("\\.", "", first_last)
#combine columns again
apply(first_last, 1, paste, collapse = ", ")
Upvotes: 5
Reputation: 47541
I just figured out a really easy workaround for this problem which I am posting here for reference. Simply gsub
the string first to something else that is easier to split:
names <- "Jansen, A., Karel, A., Jong, A. de, Pietersen, K."
names <- gsub("\\<de\\>,","de.,",names)
strsplit(names,split="\\.\\,")
[[1]]
[1] "Jansen, A" " Karel, A" " Jong, A. de" " Pietersen, K."
I guess this requires a seperate gsub()
statement for each way this can occur (in Dutch you have van, der, de, te, ten, and more), so it isn't ideal, but gets the job done.
Upvotes: 1
Reputation: 56162
Try this regex: \.,|(?<=de),
with look-behind.
It will match:
Jansen, A.,
Karel, A.,
Jong, A. de,
Pietersen, K.
Upvotes: 3