PavoDive
PavoDive

Reputation: 6496

regex in R "eats" part of the string

I want to split a character string into two groups. The string's structure is pretty simple, yet I haven't been able to make it work.

txt <- "text12-01-2016"

It's always some letters, followed by a date, and the date, obviously starts with a number. I've tried the following regex at https://regex101.com/ and effectively get the string properly separated:

([a-zA-Z]*)([0-9].*)
1. "text"
2. "12-01-2016"

But when I try in R it fails:

strsplit(a[1],split = "([a-zA-Z]*)([0-9]*)")
[[1]]
 [1] ""  " " ""  "." " " ""  " " ""  "-" ""  "-" "" 

And if I introduce double square brackets, then it "eats" out the last character of the first group, and the first of the second:

strsplit(txt,split = "([[a-zA-Z]]*)([[0-9]]*)")
[[1]]
[1] "tex"      "2-01-2016"

It doesn't matter if I use perl=TRUE. Result is consistent also if I use stringi::stri_split, so it's a problem in my regex.

What is the correct regex to use in this case?

Upvotes: 2

Views: 70

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

The "problem" here is that you have a regex for matching, not for splitting.

You can use the following PCRE regex with strsplit:

strsplit(txt,split = "(?<=[a-zA-Z])(?=[0-9])", perl=T)
[[1]]
[1] "text"       "12-01-2016"

The regex will match the location between a letter and a digit and strsplit will split the result. You can unlist it further on if you need.

If you want to use your regex, use str_match from stringr:

> library(stringr)
>str_match(txt,  "([a-zA-Z]*)([0-9].*)")
     [,1]             [,2]   [,3]        
[1,] "text12-01-2016" "text" "12-01-2016"

Upvotes: 5

Related Questions