Swanny
Swanny

Reputation: 189

Splitting Column Conditionally in R

My data frame looks something like this:

Var
H2307
A123
F45fjhsk
category
J30HS

And I'd like it to look like this:

Var       Var_1       Var_2
H2307     H           2307
A123      A           123
F45fjhsk  NA          NA
category  NA          NA
J30HS     J           30HS

I have tried variations of this:

for (i in 1:length(dat$Var)){
   if (nchar(dat$Var) < 7){
     tx <- strsplit(dat$Var[i], split = "(?<=[a-zA-Z])(?=[0-9])", perl = T)
     tx <- t(matrix(tx, nrow=2, ncol=length(tx)/2))
   }
 }

which is getting close I think but still doesn't work; the splitting part works pretty fine. I have the "< 7" because all the strings I want to split are less than 7 characters, so it excludes the "F45fjhsk" entry.

Upvotes: 1

Views: 380

Answers (2)

akuiper
akuiper

Reputation: 214927

Here is one option with tidyr::extract:

library(tidyr)
df <- df %>% 
    extract(Var, into=c("Var_1", "Var_2"), regex="^(?=.{1,7}$)([a-zA-Z]+)([0-9].*)$", remove=FALSE)
df

#       Var Var_1 Var_2
#1    H2307     H  2307
#2     A123     A   123
#3 F45fjhsk  <NA>  <NA>
#4 category  <NA>  <NA>
#5    J30HS     J  30HS

^(?=.{1,7}$) asserts the total number of characters to be less than or equal to seven; ([a-zA-Z]+) matches the non digits part from the beginning of the string; ([0-9].*) matches everything after the first digit.

Upvotes: 3

i-man
i-man

Reputation: 568

It looks like your regEx is excluding the possibility of having letters in the second group

([a-zA-Z])(.+)

By using (.+) in the second collection, you will be able to handle that case also.

Upvotes: 2

Related Questions