Reputation: 343
I have a dataframe looking like more or less like this:
name_position
RAHEEM STERLINGForward
MARCUS RASHFORDForward
JORDAN HENDERSONMidfielder
JORDAN PICKFORDGoalkeeper
KYLE WALKERDefender
My purpose is to create two columns of this previous one, so I've created a vector containing all the available positions
positions <- c("Goalkeeper", "Defender", "Midfielder", "Forward")
Then I've been trying with functions such as separate()
, extract()
or even str_match
, but I'm not being able to get the output I desire to reach, which would look this way:
name position
RAHEEM STERLING Forward
MARCUS RASHFORD Forward
JORDAN HENDERSON Midfielder
JORDAN PICKFORD Goalkeeper
KYLE WALKER Defender
Upvotes: 2
Views: 46
Reputation: 21400
Use str_extract
from stringr
:
df1$position <- str_extract(df1$name_position, "(?<=[A-Z])[A-Z][a-z]+")
Result:
df1
name_position position
1 RAHEEM STERLINGForward Forward
2 MARCUS RASHFORDForward Forward
3 JORDAN HENDERSONMidfielder Midfielder
4 JORDAN PICKFORDGoalkeeper Goalkeeper
5 KYLE WALKERDefender Defender
This solution uses positive lookbehind:
(?<=[A-Z])
if you see an upper-case letter on left ...
[A-Z][a-z]+
... match the subsequent upper-case letter plus the one or more lower-case letters following it
Upvotes: 2
Reputation: 886938
We can use separate
with a regex lookaround
library(dplyr)
library(tidyr)
df1 %>%
separate(name_position, into = c("name", "position"),
sep="(?<=[A-Z])(?=[A-Z][a-z])")
# name position
#1 RAHEEM STERLING Forward
#2 MARCUS RASHFORD Forward
#3 JORDAN HENDERSON Midfielder
#4 JORDAN PICKFORD Goalkeeper
#5 KYLE WALKER Defender
If we have a custom vector, then one option is to create a pat
tern vector by creating a single string
library(stringr)
pat <- str_c(positions, collapse="|")
df1 %>%
transmute(name = str_remove(name_position, pat),
position = str_extract(name_position, pat))
df1 <- structure(list(name_position = c("RAHEEM STERLINGForward", "MARCUS RASHFORDForward",
"JORDAN HENDERSONMidfielder", "JORDAN PICKFORDGoalkeeper",
"KYLE WALKERDefender"
)), class = "data.frame", row.names = c(NA, -5L))
Upvotes: 2