Reputation: 23
My data frame is as follows:
User
JohnLenon03041965
RogerFederer12021954
RickLandsman01041975
and I am trying to get the output as
Name Lastname Birthdate
John Lenon 03041965
Roger Federer 12021954
Rick Landsman 01041975
I tried the following code:
**a = gsub('([[:upper:]])', ' \\1', df$User)
a <- as.data.frame(a)
library(tidyr)
a <-separate(a, a, into = c("Name", "Last"), sep = " (?=[^ ]+$)")**
I get the following:
Name Last
John Lenon03041965
Roger Federer12021954
Rick Landsman01041975
I am trying to use the separate condition like (?=[0-9]) but getting error like this:
c <-separate(c, c, into = c("last", "date"), sep = '(?=[0-9])')
Error in if (!after) c(values, x) else if (after >= lengx) c(x, values) else c(x[1L:after], : argument is of length zero
Upvotes: 1
Views: 53
Reputation: 887951
We can use a regex lookaround as sep
by specifying either to split between a lower case letter and an upper case ((?<=[a-z])(?=[A-Z])
) or (|
) between a lower case letter and a number ((?<=[a-z])(?=[0-9]+)
)
df1 %>%
separate(User, into = c("Name", "LastName", "Birthdate"),
sep = "(?<=[a-z])(?=[A-Z])|(?<=[a-z])(?=[0-9]+)")
# Name LastName Birthdate
#1 John Lenon 03041965
#2 Roger Federer 12021954
#3 Rick Landsman 01041975
Or another option is extract
to capture characters as a group by placing it inside the brackets ((...)
). Here, the 1st capture group matches an upper case letter followed by one or more lower case letters (([A-Z][a-z])
) from the start (^
) of the string, 2nd captures one or more characters that are not numbers (([^0-9]+)
) and in the 3rs, it is the rest of the characters ((.*)
)
df1 %>%
extract(User, into = c("Name", "LastName", "Birthdate"),
"^([A-Z][a-z]+)([^0-9]+)(.*)")
# Name LastName Birthdate
#1 John Lenon 03041965
#2 Roger Federer 12021954
#3 Rick Landsman 01041975
df1 <- structure(list(User = c("JohnLenon03041965", "RogerFederer12021954",
"RickLandsman01041975")), .Names = "User", class = "data.frame", row.names = c(NA,
-3L))
Upvotes: 1