Reputation: 69

extracting nth occurrence after string

have a column that looks like this

test <- c("QB Deshaun Watson RB Alvin Kamara FLEX Chris Carson RB Jamaal Williams WR Davante Adams WR Brandin Cooks WR Christian Kirk TE Darren Fells DST Browns", "QB Kyler Murray RB Alvin Kamara FLEX Chris Carson RB Jamaal Williams WR DeAndre Hopkins WR Terry McLaurin WR Marvin Jones Jr. TE David Njoku DST Saints","RB Alvin Kamara FLEX Giovani Bernard RB Jamaal Williams WR Stefon Diggs WR Keenan Allen DST Titans QB Derek Carr WR Chris Godwin TE Darren Waller")

I want to extract the name following the 2nd instance of "RB" in each row.

To get the first match, I'm using

str_extract(test, "RB\\W+\\S+\\W+\\S+")

which returns

[1] "RB Alvin Kamara" "RB Alvin Kamara" "RB Alvin Kamara"

This works for the first match, but I want the 2nd as well (only the 2nd). Any help greatly appreciated!

Upvotes: 0

Answers (2)

r2evans

Reputation: 160417

Similar, base-R:

gre <- gregexpr("RB\\s*(\\S+(\\s?\\S*))", test)
regmatches(test, gre)
# [[1]]
# [1] "RB Alvin Kamara"    "RB Jamaal Williams"
# [[2]]
# [1] "RB Alvin Kamara"    "RB Jamaal Williams"
# [[3]]
# [1] "RB Alvin Kamara"    "RB Jamaal Williams"

From here, you can use sapply(...) to get the second, as in

sapply(regmatches(test, gre), `[[`, 2)
# [1] "RB Jamaal Williams" "RB Jamaal Williams" "RB Jamaal Williams"

As an aside, if you're planning on doing this programmatically for more than just the "second RB", then here's a slightly more general solution to splitting that string:

gre <- gregexpr("\\b[A-Z][A-Z]+\\b", test)
pos <- regmatches(test, gre, invert=FALSE)
str(pos)
# List of 3
#  $ : chr [1:9] "QB" "RB" "FLEX" "RB" ...
#  $ : chr [1:9] "QB" "RB" "FLEX" "RB" ...
#  $ : chr [1:9] "RB" "FLEX" "RB" "WR" ...
name <- regmatches(test, gre, invert=TRUE)
str(name)
# List of 3
#  $ : chr [1:10] "" " Deshaun Watson " " Alvin Kamara " " Chris Carson " ...
#  $ : chr [1:10] "" " Kyler Murray " " Alvin Kamara " " Chris Carson " ...
#  $ : chr [1:10] "" " Alvin Kamara " " Giovani Bernard " " Jamaal Williams " ...

Notice how the name vectors all have the preceding ""; since we know that we're only concerned with names following the position, we can omit that and then paste everything together:

Map(paste, lapply(pos, trimws), lapply(name, function(a) trimws(a[-1])))
# [[1]]
# [1] "QB Deshaun Watson"  "RB Alvin Kamara"    "FLEX Chris Carson"  "RB Jamaal Williams" "WR Davante Adams"  
# [6] "WR Brandin Cooks"   "WR Christian Kirk"  "TE Darren Fells"    "DST Browns"        
# [[2]]
# [1] "QB Kyler Murray"     "RB Alvin Kamara"     "FLEX Chris Carson"   "RB Jamaal Williams"  "WR DeAndre Hopkins" 
# [6] "WR Terry McLaurin"   "WR Marvin Jones Jr." "TE David Njoku"      "DST Saints"         
# [[3]]
# [1] "RB Alvin Kamara"      "FLEX Giovani Bernard" "RB Jamaal Williams"   "WR Stefon Diggs"      "WR Keenan Allen"     
# [6] "DST Titans"           "QB Derek Carr"        "WR Chris Godwin"      "TE Darren Waller"

(That has a fairly liberal use of trimws, which could be handled afterward, but this was a more surgical/specific extraction of whitespace.)

And since you wanted the n^th, here's a little more.

data.frame(
  ind = rep(seq_along(test), times = lengths(pos)),
  nth = unlist(lapply(pos, function(z) 1 + ave(duplicated(z), z, FUN = cumsum))),
  pos = unlist(lapply(pos, trimws)),
  name = unlist(lapply(lapply(name, `[`, -1), trimws))
)
#    ind nth  pos             name
# 1    1   1   QB   Deshaun Watson
# 2    1   1   RB     Alvin Kamara
# 3    1   1 FLEX     Chris Carson
# 4    1   2   RB  Jamaal Williams
# 5    1   1   WR    Davante Adams
# 6    1   2   WR    Brandin Cooks
# 7    1   3   WR   Christian Kirk
# 8    1   1   TE     Darren Fells
# 9    1   1  DST           Browns
# 10   2   1   QB     Kyler Murray
# 11   2   1   RB     Alvin Kamara
# 12   2   1 FLEX     Chris Carson
# 13   2   2   RB  Jamaal Williams
# 14   2   1   WR  DeAndre Hopkins
# 15   2   2   WR   Terry McLaurin
# 16   2   3   WR Marvin Jones Jr.
# 17   2   1   TE      David Njoku
# 18   2   1  DST           Saints
# 19   3   1   RB     Alvin Kamara
# 20   3   1 FLEX  Giovani Bernard
# 21   3   2   RB  Jamaal Williams
# 22   3   1   WR     Stefon Diggs
# 23   3   2   WR     Keenan Allen
# 24   3   1  DST           Titans
# 25   3   1   QB       Derek Carr
# 26   3   3   WR     Chris Godwin
# 27   3   1   TE    Darren Waller

for which you can simply filter on nth and pos.

(This highlights the fact that "Marvin Jones Jr." will break any regex that relies on a first/last name pairing.)

Upvotes: 1

Michael Dewar

Reputation: 3248

str_extract_all(test, "RB\\W+\\S+\\W+\\S+")

will return a list of all matches. If you only want the second from each, you can use

str_extract_all(test, "RB\\W+\\S+\\W+\\S+") %>% map(~.[[2]])

Upvotes: 2

extracting nth occurrence after string

Answers (2)

Related Questions