Reputation: 2085
Let's say I have a string like this:
my_string = "my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool"
And I'd like to extract the first and the second date separately with stringr
.
I tried something like str_extract(my_string, '(\\d+\\.\\d+\\.\\d+){n}')
and while it works when n=1
it doesn't work with n=2
. How can I extract the second occurence?
Example of data.frame
:
df <- data.frame(string_col = c("my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool",
"my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd",
"asdad asda-adsad KK-ASD-20.05.05-jjj"))
And I want to create columns date1
, date2
.
Although @RonanShah and @ThomasIsCoding provided solutions based on str_extract_all
, I'd really like to get to know how we can do it using regex only as finding n-th occurence seems to be important pattern and potentially may result in much neater solution.
Upvotes: 1
Views: 681
Reputation: 47320
This is a good example to showcase {unglue}.
Here you have 2 patterns (one date or two dates), the first is two dates separated by a dash and surrounded by anything, the second is a date surrounded by anything. We can write it this way :
library(unglue)
unglue_unnest(
df, string_col,
c("{}{date1=\\d+\\.\\d+\\.\\d+}-{date2=\\d+\\.\\d+\\.\\d+}{}",
"{}{date1=\\d+\\.\\d+\\.\\d+}{}"),
remove = FALSE)
#> string_col date1 date2
#> 1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
#> 2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
#> 3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 <NA>
Upvotes: 0
Reputation: 9496
(I) Capturing groups (marked by ()
) can be multiplied by {n}
but will then count only as one capture group and match the last instance. If you explicitly write down capturing gorups for both dates, you can use str_match
(without the "_all
"):
> stringr::str_match(df$string_col, '(\\d+\\.\\d+\\.\\d+)-(\\d+\\.\\d+\\.\\d+)?')[, -1, drop = FALSE]
[,1] [,2]
[1,] "19.01.03" "20.01.22"
[2,] "20.01.08" "20.04.01"
[3,] "20.05.05" NA
Here, ?
makes the occurrence of the second date optional and [, -1, drop = FALSE]
removes the first column that always contains the whole match. You might want to change the -
in the pattern to something more general.
To really find only the n
th match, you could use (I) in a expression like this:
stringr::str_match(df$string_col, paste0('(?:(\\d+\\.\\d+\\.\\d+).*){', n, '}'))[, -1]
[1] "0.01.22" "0.04.01" NA
Here, we used (?:
)
to specify a non-capturing group, such the the caputure ((
)
) does not include whats in between dates (.*
).
Upvotes: 1
Reputation: 101373
I guess you might need str_extract_all
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')
or regmatches
if you prefer with base R
regmatches(my_string,gregexpr('(\\d+\\.\\d+\\.\\d+)',my_string))
Update
With your data frame df
transform(df,
date = do.call(
rbind,
lapply(
u <- str_extract_all(string_col, "(\\d+\\.\\d+\\.\\d+)"),
`length<-`,
max(lengths(u))
)
)
)
we will get
string_col date.1 date.2
1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 <NA>
Upvotes: 0
Reputation: 388982
str_extract
would always return the first match. While there might be ways altering your regex to capture the n
th occurrence of a pattern but a simple way would be to use str_extract_all
and return the n
th value.
library(stringr)
n <- 1
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')[[1]][n]
#[1] "19.01.03"
n <- 2
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')[[1]][n]
#[1] "20.01.22"
For the dataframe input we can extract all the date pattern and store it in a list and use unnest_wider
to get them as separate columns.
library(dplyr)
df %>%
mutate(date = str_extract_all(string_col, '\\d+\\.\\d+\\.\\d+')) %>%
tidyr::unnest_wider(date) %>%
rename_with(~paste0('date', seq_along(.)), starts_with('..'))
# string_col date1 date2
# <chr> <chr> <chr>
#1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
#2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
#3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 NA
Upvotes: 0
Reputation: 415
you could use stringr::str_extract_all()
instead, like this
str_extract_all(my_string, '\\d+\\.\\d+\\.\\d+')
Upvotes: 0