Reputation: 69
have a column that looks like this
test <- c("QB Deshaun Watson RB Alvin Kamara FLEX Chris Carson RB Jamaal Williams WR Davante Adams WR Brandin Cooks WR Christian Kirk TE Darren Fells DST Browns", "QB Kyler Murray RB Alvin Kamara FLEX Chris Carson RB Jamaal Williams WR DeAndre Hopkins WR Terry McLaurin WR Marvin Jones Jr. TE David Njoku DST Saints","RB Alvin Kamara FLEX Giovani Bernard RB Jamaal Williams WR Stefon Diggs WR Keenan Allen DST Titans QB Derek Carr WR Chris Godwin TE Darren Waller")
I want to extract the name following the 2nd instance of "RB" in each row.
To get the first match, I'm using
str_extract(test, "RB\\W+\\S+\\W+\\S+")
which returns
[1] "RB Alvin Kamara" "RB Alvin Kamara" "RB Alvin Kamara"
This works for the first match, but I want the 2nd as well (only the 2nd). Any help greatly appreciated!
Upvotes: 0
Views: 41
Reputation: 160417
Similar, base-R:
gre <- gregexpr("RB\\s*(\\S+(\\s?\\S*))", test)
regmatches(test, gre)
# [[1]]
# [1] "RB Alvin Kamara" "RB Jamaal Williams"
# [[2]]
# [1] "RB Alvin Kamara" "RB Jamaal Williams"
# [[3]]
# [1] "RB Alvin Kamara" "RB Jamaal Williams"
From here, you can use sapply(...)
to get the second, as in
sapply(regmatches(test, gre), `[[`, 2)
# [1] "RB Jamaal Williams" "RB Jamaal Williams" "RB Jamaal Williams"
As an aside, if you're planning on doing this programmatically for more than just the "second RB", then here's a slightly more general solution to splitting that string:
gre <- gregexpr("\\b[A-Z][A-Z]+\\b", test)
pos <- regmatches(test, gre, invert=FALSE)
str(pos)
# List of 3
# $ : chr [1:9] "QB" "RB" "FLEX" "RB" ...
# $ : chr [1:9] "QB" "RB" "FLEX" "RB" ...
# $ : chr [1:9] "RB" "FLEX" "RB" "WR" ...
name <- regmatches(test, gre, invert=TRUE)
str(name)
# List of 3
# $ : chr [1:10] "" " Deshaun Watson " " Alvin Kamara " " Chris Carson " ...
# $ : chr [1:10] "" " Kyler Murray " " Alvin Kamara " " Chris Carson " ...
# $ : chr [1:10] "" " Alvin Kamara " " Giovani Bernard " " Jamaal Williams " ...
Notice how the name
vectors all have the preceding ""
; since we know that we're only concerned with names following the position, we can omit that and then paste everything together:
Map(paste, lapply(pos, trimws), lapply(name, function(a) trimws(a[-1])))
# [[1]]
# [1] "QB Deshaun Watson" "RB Alvin Kamara" "FLEX Chris Carson" "RB Jamaal Williams" "WR Davante Adams"
# [6] "WR Brandin Cooks" "WR Christian Kirk" "TE Darren Fells" "DST Browns"
# [[2]]
# [1] "QB Kyler Murray" "RB Alvin Kamara" "FLEX Chris Carson" "RB Jamaal Williams" "WR DeAndre Hopkins"
# [6] "WR Terry McLaurin" "WR Marvin Jones Jr." "TE David Njoku" "DST Saints"
# [[3]]
# [1] "RB Alvin Kamara" "FLEX Giovani Bernard" "RB Jamaal Williams" "WR Stefon Diggs" "WR Keenan Allen"
# [6] "DST Titans" "QB Derek Carr" "WR Chris Godwin" "TE Darren Waller"
(That has a fairly liberal use of trimws
, which could be handled afterward, but this was a more surgical/specific extraction of whitespace.)
And since you wanted the nth, here's a little more.
data.frame(
ind = rep(seq_along(test), times = lengths(pos)),
nth = unlist(lapply(pos, function(z) 1 + ave(duplicated(z), z, FUN = cumsum))),
pos = unlist(lapply(pos, trimws)),
name = unlist(lapply(lapply(name, `[`, -1), trimws))
)
# ind nth pos name
# 1 1 1 QB Deshaun Watson
# 2 1 1 RB Alvin Kamara
# 3 1 1 FLEX Chris Carson
# 4 1 2 RB Jamaal Williams
# 5 1 1 WR Davante Adams
# 6 1 2 WR Brandin Cooks
# 7 1 3 WR Christian Kirk
# 8 1 1 TE Darren Fells
# 9 1 1 DST Browns
# 10 2 1 QB Kyler Murray
# 11 2 1 RB Alvin Kamara
# 12 2 1 FLEX Chris Carson
# 13 2 2 RB Jamaal Williams
# 14 2 1 WR DeAndre Hopkins
# 15 2 2 WR Terry McLaurin
# 16 2 3 WR Marvin Jones Jr.
# 17 2 1 TE David Njoku
# 18 2 1 DST Saints
# 19 3 1 RB Alvin Kamara
# 20 3 1 FLEX Giovani Bernard
# 21 3 2 RB Jamaal Williams
# 22 3 1 WR Stefon Diggs
# 23 3 2 WR Keenan Allen
# 24 3 1 DST Titans
# 25 3 1 QB Derek Carr
# 26 3 3 WR Chris Godwin
# 27 3 1 TE Darren Waller
for which you can simply filter on nth
and pos
.
(This highlights the fact that "Marvin Jones Jr."
will break any regex that relies on a first/last name pairing.)
Upvotes: 1
Reputation: 3248
str_extract_all(test, "RB\\W+\\S+\\W+\\S+")
will return a list of all matches. If you only want the second from each, you can use
str_extract_all(test, "RB\\W+\\S+\\W+\\S+") %>% map(~.[[2]])
Upvotes: 2