user1320502
user1320502

Reputation: 2570

extract string after first occurrence of pattern AND before another pattern

I have the following string:

strings <- c("David, FC; Haramey, S; Devan, IA", 
            "Colin, Matthew J.; Haramey, S",
            "Colin, Matthew")

If I want the last initials/givenname for all strings i can use the following:

sub(".*, ", "", strings)
[1] "IA"      "S"       "Matthew"

This removes everything before the last ", "

However, I am stuck on how to get the the first initials/givenname. I know have to remove everything before the first ", " but then I have to remove everything after any spaces, semicolons, if any.

To be clear the output I want is:

c("FC", "Matthew", "Matthew")

Any pointers would be great.

fiddling i can get the first surnames gsub( " .*$", "", strings )

Upvotes: 2

Views: 1026

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You can use

> gsub( "^[^\\s,]+,\\s+([^;.\\s]+).*", "\\1", strings, perl=T)
[1] "FC"      "Matthew" "Matthew"

See the regex demo

Explanation:

  • ^ - start of string
  • [^\\s,]+ - 1 or more characters other than whitespace or ,
  • , - a literal comma
  • \\s+ - 1 or more whitespace
  • ([^;.\\s]+) - Group 1 matching 1 or more characters other than ;, . or whitespace
  • .* - zero or more any character other than a newline

If you want to use a POSIX-like expression, replace \\s inside the character classes (inside [...]) with [:blank:] (or [:space:]):

gsub( "^[^[:blank:],]+,\\s+([^;.[:blank:]]+).*", "\\1", strings)

Upvotes: 5

Related Questions