Reputation: 176

Regex in R: Return index of the last digit of the first instance of numeric characters in a string

Following are sample outputs

Input_String                |   output_col1   |   output_col2
a-123/123 Lion's park       |   a-123/123     |   Lion's park
b/11-341 lion 34 park       |   b/11-341      |   lion 34 park
flat 701 sector 4 city x    |   flat 701      |   sector 4 city x

if the numbers are separated by alphabets, they need to be considered as different numbers and only the first incidence needs to be captured in output_col1, and if they are separated by punctuations they should be considered as one single number.

Upvotes: 1

Answers (3)

G. Grothendieck

Reputation: 269481

1) gsubfn::read.pattern This uses read.pattern and a regex with two capture groups, one for each column:

library(gsubfn)
Input <- c("a-123/123 Lion's park", "b/11-341 lion 34 park", "flat 701 sector 4 city x")

data.frame(Input, read.pattern(text = Input, pattern = "^(.*?\\d\\S+) (.*)$", quote = "",
  as.is = TRUE, col.names = c("col1", "col2")), stringsAsFactors = FALSE)

giving:

                     Input      col1            col2
1    a-123/123 Lion's park a-123/123     Lion's park
2    b/11-341 lion 34 park  b/11-341    lion 34 park
3 flat 701 sector 4 city x  flat 701 sector 4 city x

2) no packages Using the same input and regex as above:

pat <- "^(.*?\\d\\S+) (.*)$"
data.frame(Input, 
           col1 = sub(pat, "\\1", Input, perl = TRUE), 
           col2 = sub(pat, "\\2", Input, perl = TRUE),
           stringsAsFactors = FALSE)

giving the same output.

Upvotes: 0

akrun

Reputation: 887048

We can use str_split

library(stringr)
df1[c("output_col1", "output_col2")] <- do.call(rbind, 
       str_split(df1$Input_string, "(?<=[0-9])\\s+(?=[A-Za-z])", n=2))
df1
#              Input_string output_col1     output_col2
#1    a-123/123 Lion's park   a-123/123     Lion's park
#2    b/11-341 lion 34 park    b/11-341    lion 34 park
#3 flat 701 sector 4 city x    flat 701 sector 4 city x

Or without using any external packages

df2 <- cbind(df1, read.csv(text=sub("([-/ ]\\d+)\\s+", "\\1,", 
    df1$Input_string), header = FALSE, col.names = c('output_col1',
          'output_col2'), stringsAsFactors=FALSE))
df2
#              Input_string output_col1     output_col2
#1    a-123/123 Lion's park   a-123/123     Lion's park
#2    b/11-341 lion 34 park    b/11-341    lion 34 park
#3 flat 701 sector 4 city x    flat 701 sector 4 city x

data

df1 <- structure(list(Input_string = c("a-123/123 Lion's park", "b/11-341 lion 34 park", 
"flat 701 sector 4 city x")), .Names = "Input_string", row.names = c(NA, 
-3L), class = "data.frame")

Upvotes: 1

alistaire

Reputation: 43334

tidyr::separate can make the new columns using a lookbehind and extra = "merge":

library(tidyr)

df <- structure(list(Input_String = c("a-123/123 Lion's park", "b/11-341 lion 34 park", 
    "flat 701 sector 4 city x"), output_col1 = c("a-123/123", "b/11-341", 
    "flat 701"), output_col2 = c("Lion's park", "lion 34 park", "sector 4 city x"
    )), class = "data.frame", .Names = c("Input_String", "output_col1", 
    "output_col2"), row.names = c(NA, -3L))

df %>% separate(Input_String,    # column to separate
                into = paste0('out', 1:2),    # new column names
                sep = '(?<=\\d)\\s',    # use lookbehind in separator
                extra = 'merge')    # merge extra splits into second column

#>        out1            out2 output_col1     output_col2
#> 1 a-123/123     Lion's park   a-123/123     Lion's park
#> 2  b/11-341    lion 34 park    b/11-341    lion 34 park
#> 3  flat 701 sector 4 city x    flat 701 sector 4 city x

Upvotes: 0

Regex in R: Return index of the last digit of the first instance of numeric characters in a string

Answers (3)

data

Related Questions