Reputation: 79
So I am looking to turn a divide a string variable into several pieces, but the substrings I'm splitting them into are of different lengths and I don't have separators like . , | etc. So I'm starting with a data frame like:
df <- data.frame(x=c("bigApe","smallApe","bigDog","smallDog"),c(1,2,5,3))
x y
bigApe 1
smallApe 2
bigDog 5
smallDog 3
And I'd like it to wind up as something like:
size anim y
1 big Ape 1
2 small Ape 2
3 big Dog 5
4 small Dog 3
I've looked at things using separate() that seem like they should be able to do that but they all seem to look for either a predictable separator/white space or a set substring length. I can do it as a regex looking for a capital letter but it doesn't keep the letter than:
df %>% separate(x,c("size","anim"),sep="[A-Z]")
size anim num
1 big pe 1
2 small pe 2
3 big og 5
4 small og 3
The data I'm looking for doesn't have that. I think I could add some with something in stringr but even there everything I'm finding seems to want a specified string length. I could certainly put together a hideous for-loop but there must be a quicker way than that!
Thanks!
Upvotes: 2
Views: 75
Reputation: 703
You can also use the base R function gsub
to parse the original column using regular expression groups.
df$size <- gsub("([a-z]*)([A-Z]?[a-z]*)", "\\1", df$x)
df$animal <- gsub("([a-z]*)([A-Z]?[a-z]*)", "\\2", df$x)
Upvotes: 1
Reputation: 155
I'm not sure you can retain the delimiter using separate... you could however use stringr::str_locate()
to find the start position of the capital letter and then use substr
(along with some dplyr
magic):
data.frame(x=c("bigApe","smallApe","bigDog","smallDog"),c(1,2,5,3), stringsAsFactors = FALSE) %>%
rowwise() %>%
mutate(size = substr(x, 1,stringr::str_locate(x, "[A-Z]")[1]-1),
animal = substr(x, stringr::str_locate(x, "[A-Z]")[1], nchar(x))
)
# A tibble: 4 x 4
# Rowwise:
x c.1..2..5..3. size animal
<chr> <dbl> <chr> <chr>
1 bigApe 1 big Ape
2 smallApe 2 small Ape
3 bigDog 5 big Dog
4 smallDog 3 small Dog
Upvotes: 1
Reputation: 7858
You need this:
df %>% separate(x,c("size","anim"), sep = "(?!^)(?=[[:upper:]])")
# A tibble: 4 x 3
size anim y
<chr> <chr> <dbl>
1 big Ape 1
2 small Ape 2
3 big Dog 5
4 small Dog 3
Upvotes: 2