jakes
jakes

Reputation: 2085

How to extract n-th occurence of a pattern with regex

Let's say I have a string like this:

my_string = "my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool"

And I'd like to extract the first and the second date separately with stringr.

I tried something like str_extract(my_string, '(\\d+\\.\\d+\\.\\d+){n}') and while it works when n=1 it doesn't work with n=2. How can I extract the second occurence?

Example of data.frame:

df <- data.frame(string_col = c("my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool", 
                                "my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd", 
                                "asdad asda-adsad KK-ASD-20.05.05-jjj"))

And I want to create columns date1, date2.

Edit:

Although @RonanShah and @ThomasIsCoding provided solutions based on str_extract_all, I'd really like to get to know how we can do it using regex only as finding n-th occurence seems to be important pattern and potentially may result in much neater solution.

Upvotes: 1

Views: 681

Answers (5)

moodymudskipper
moodymudskipper

Reputation: 47320

This is a good example to showcase {unglue}.

Here you have 2 patterns (one date or two dates), the first is two dates separated by a dash and surrounded by anything, the second is a date surrounded by anything. We can write it this way :

library(unglue)

unglue_unnest(
  df, string_col, 
  c("{}{date1=\\d+\\.\\d+\\.\\d+}-{date2=\\d+\\.\\d+\\.\\d+}{}", 
    "{}{date1=\\d+\\.\\d+\\.\\d+}{}"), 
  remove = FALSE)
#>                                             string_col    date1    date2
#> 1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
#> 2  my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
#> 3                 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05     <NA>

Upvotes: 0

jan-glx
jan-glx

Reputation: 9496

(I) Capturing groups (marked by ()) can be multiplied by {n} but will then count only as one capture group and match the last instance. If you explicitly write down capturing gorups for both dates, you can use str_match (without the "_all"):

> stringr::str_match(df$string_col, '(\\d+\\.\\d+\\.\\d+)-(\\d+\\.\\d+\\.\\d+)?')[, -1, drop = FALSE]
     [,1]       [,2]      
[1,] "19.01.03" "20.01.22"
[2,] "20.01.08" "20.04.01"
[3,] "20.05.05" NA 

Here, ? makes the occurrence of the second date optional and [, -1, drop = FALSE] removes the first column that always contains the whole match. You might want to change the - in the pattern to something more general.

To really find only the nth match, you could use (I) in a expression like this:

stringr::str_match(df$string_col, paste0('(?:(\\d+\\.\\d+\\.\\d+).*){', n, '}'))[, -1]
[1] "0.01.22" "0.04.01" NA 

Here, we used (?: ) to specify a non-capturing group, such the the caputure (( )) does not include whats in between dates (.*).

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101373

I guess you might need str_extract_all

str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')

or regmatches if you prefer with base R

regmatches(my_string,gregexpr('(\\d+\\.\\d+\\.\\d+)',my_string))

Update With your data frame df

transform(df,
  date = do.call(
    rbind,
    lapply(
      u <- str_extract_all(string_col, "(\\d+\\.\\d+\\.\\d+)"),
      `length<-`,
      max(lengths(u))
    )
  )
)

we will get

                                            string_col   date.1   date.2
1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
2  my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
3                 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05     <NA>

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388982

str_extract would always return the first match. While there might be ways altering your regex to capture the nth occurrence of a pattern but a simple way would be to use str_extract_all and return the nth value.

library(stringr)
n <- 1
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')[[1]][n]
#[1] "19.01.03"
n <- 2
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')[[1]][n]
#[1] "20.01.22"

For the dataframe input we can extract all the date pattern and store it in a list and use unnest_wider to get them as separate columns.

library(dplyr)
df %>%
  mutate(date = str_extract_all(string_col, '\\d+\\.\\d+\\.\\d+')) %>%
  tidyr::unnest_wider(date) %>%
  rename_with(~paste0('date', seq_along(.)), starts_with('..'))

# string_col                                           date1    date2   
#  <chr>                                                <chr>    <chr>   
#1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
#2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd  20.01.08 20.04.01
#3 asdad asda-adsad KK-ASD-20.05.05-jjj                 20.05.05 NA      

Upvotes: 0

waskuf
waskuf

Reputation: 415

you could use stringr::str_extract_all() instead, like this

str_extract_all(my_string, '\\d+\\.\\d+\\.\\d+')

Upvotes: 0

Related Questions