lukehawk
lukehawk

Reputation: 1493

How to use str_split with regex in R?

I have this string:

235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things

I want to split the string by the 6-digit numbers. I.e. - I want this:

235072,testing,some2wg2f4,wf484-things
224072,and,other25wg4,14-thingies
223552,testing,some/2wr24,14084-things

How do I do this with regex? The following does not work (using stringr package):

> blahblah <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
> test <- str_split(blahblah, "([0-9]{6}.*)")
> test
[[1]]
[1] "" ""

What am I missing??

Upvotes: 2

Views: 3048

Answers (4)

bhakyuz
bhakyuz

Reputation: 97

With less complex regex, you can do as following:

s <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
l <- str_locate_all(string = s, "[0-9]{6}")
str_sub(string = s, start = as.data.frame(l)$start, 
    end = c(tail(as.data.frame(l)$start, -1) - 1, nchar(s)) )
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"

Upvotes: 0

DuckPyjamas
DuckPyjamas

Reputation: 1659

An easy-to-understand approach is to add a marker and then split on the locations of those markers. This has the advantage of being able to only look for 6-digit sequences and not require any other features in the surrounding text, whose features may change as you add new and unvetted data.

library(stringr)
library(magrittr)

str <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"

out <- 
    str_replace_all(str, "(\\d{6})", "#SPLIT_HERE#\\1") %>% 
    str_split("#SPLIT_HERE#") %>% 
    unlist

[1] ""                                       "235072,testing,some252f4,14084-things" 
[3] "224072,and,other2524,14084-thingies"    "223552,testing,some/2wr24,14084-things"

If your match occurs at the start or end of a string, str_split() will insert blank character entries in the results vector to indicate that (as it did above). If you don't need that information, you can easily remove it with out[nchar(out) != 0].

[1] "235072,testing,some252f4,14084-things"  "224072,and,other2524,14084-thingies"   
[3] "223552,testing,some/2wr24,14084-things"

Upvotes: 0

Julius Vainora
Julius Vainora

Reputation: 48211

Here's an approach with base R using a positive lookahead and lookbehind, and thanks to @thelatemail for the correction:

strsplit(x, "(?<=.)(?=[0-9]{6})", perl = TRUE)[[1]]
# [1] "235072,testing,some252f4,14084-things"  
# [2] "224072,and,other2524,14084-thingies"    
# [3] "223552,testing,some/2wr24,14084-things"

Upvotes: 5

Marius
Marius

Reputation: 60070

An alternative approach with str_extract_all. Note I've used .*? to do 'non-greedy' matching, otherwise .* expands to grab everything:

> str_extract_all(blahblah, "[0-9]{6}.*?(?=[0-9]{6}|$)")[[1]]
[1] "235072,testing,some252f4,14084-things"  "224072,and,other2524,14084-thingies"    "223552,testing,some/2wr24,14084-things"

Upvotes: 4

Related Questions