Michael Clauss
Michael Clauss

Reputation: 79

Using separate() to split differently-sized strings

So I am looking to turn a divide a string variable into several pieces, but the substrings I'm splitting them into are of different lengths and I don't have separators like . , | etc. So I'm starting with a data frame like:

df <- data.frame(x=c("bigApe","smallApe","bigDog","smallDog"),c(1,2,5,3))
x         y
bigApe    1
smallApe  2
bigDog    5
smallDog  3

And I'd like it to wind up as something like:

  size  anim  y
1 big   Ape   1
2 small Ape   2
3 big   Dog   5
4 small Dog   3

I've looked at things using separate() that seem like they should be able to do that but they all seem to look for either a predictable separator/white space or a set substring length. I can do it as a regex looking for a capital letter but it doesn't keep the letter than:

df %>% separate(x,c("size","anim"),sep="[A-Z]")
   size anim num
1   big   pe   1
2 small   pe   2
3   big   og   5
4 small   og   3

The data I'm looking for doesn't have that. I think I could add some with something in stringr but even there everything I'm finding seems to want a specified string length. I could certainly put together a hideous for-loop but there must be a quicker way than that!

Thanks!

Upvotes: 2

Views: 75

Answers (3)

Dave Ross
Dave Ross

Reputation: 703

You can also use the base R function gsub to parse the original column using regular expression groups.

df$size <- gsub("([a-z]*)([A-Z]?[a-z]*)", "\\1", df$x)
df$animal <- gsub("([a-z]*)([A-Z]?[a-z]*)", "\\2", df$x)

Upvotes: 1

Mitchell Graham
Mitchell Graham

Reputation: 155

I'm not sure you can retain the delimiter using separate... you could however use stringr::str_locate() to find the start position of the capital letter and then use substr (along with some dplyr magic):

data.frame(x=c("bigApe","smallApe","bigDog","smallDog"),c(1,2,5,3), stringsAsFactors = FALSE) %>%
  rowwise() %>%
  mutate(size = substr(x, 1,stringr::str_locate(x, "[A-Z]")[1]-1),
         animal = substr(x, stringr::str_locate(x, "[A-Z]")[1], nchar(x))
  )

# A tibble: 4 x 4
# Rowwise: 
  x        c.1..2..5..3. size  animal
  <chr>            <dbl> <chr> <chr> 
1 bigApe               1 big   Ape   
2 smallApe             2 small Ape   
3 bigDog               5 big   Dog   
4 smallDog             3 small Dog  

Upvotes: 1

Edo
Edo

Reputation: 7858

You need this:

df %>% separate(x,c("size","anim"), sep = "(?!^)(?=[[:upper:]])")
# A tibble: 4 x 3
  size  anim      y
  <chr> <chr> <dbl>
1 big   Ape       1
2 small Ape       2
3 big   Dog       5
4 small Dog       3

Upvotes: 2

Related Questions