Calum You
Calum You

Reputation: 15072

How can I extract the unmatched portion of a string in R with regular expressions?

I have a vector of extremely messy strings. Here is an example:

library(tidyverse)
library(stringr)
strings <- tibble(
  name = c("lorem 11:07:59 86136-1-sed", 
           "ipsum 14:35:57 S VARNAME-ut",
           "dolor 10:37:53 1513 -2-perspiciatis",
           "sit 10:48:25",
           "amet 13:52:1365293-2-unde",
           "consectetur 11:53:1 16018-2-omnis",
           "adipiscing 11:19 17237-2-iste"
           )
)
strings_out <- strings %>% 
  mutate(heads = str_extract(name, "^.*?\\s\\d{1,2}:\\d{1,2}:\\d{1,2}")) %>% 
  mutate(ends = str_replace(name, "^.*?\\s\\d{1,2}:\\d{1,2}:\\d{1,2}", ""))
strings_out[,2:3]
#> # A tibble: 7 x 2
#>                 heads                          ends
#>                 <chr>                         <chr>
#> 1      lorem 11:07:59                   86136-1-sed
#> 2      ipsum 14:35:57                  S VARNAME-ut
#> 3      dolor 10:37:53          1513 -2-perspiciatis
#> 4        sit 10:48:25                              
#> 5       amet 13:52:13                  65293-2-unde
#> 6 consectetur 11:53:1                 16018-2-omnis
#> 7                <NA> adipiscing 11:19 17237-2-iste

So here I have strings that feature some text, followed by a time that may or may not be entered correctly, then some more text. I want to extract just the ends of the strings after the time, however they do not have any pattern that seems to correspond well to a potential regular expression using str_extract. I can easily match the first half of the strings, shown in heads. However, the only way that I found to extract the last half is to use str_replace with an empty string, as shown in ends.

I tried to include all the common errors that I noticed in this list: no pattern as to the hyphenation, spacing or string contents after the time, no guaranteed space betwene the time and the desired end half of the string, times missing digits or even colons.

What I would like to do is to be able to use str_extract to get something close to what I got with str_replace. The key difference is that for the errors where this regex still does not work, str_extract gives me an NA that is easy to filter for and fix manually, but str_replace just copies in the whole string as seen in row 7.

I suspect I could do this with some more hacky methods, like getting all the NA and fixing manually in Excel or something, but I was surprised that I could not figure out how to return the unmatched portion of a string in general despite a bunch of searching and trying different regular expressions that include (^) and [^]. Any ideas?

Upvotes: 3

Views: 508

Answers (3)

Jan
Jan

Reputation: 43169

You can have it with just one additional line:

strings["rx"] <- str_match(strings$name, "\\d*:\\d*(?::\\d+)?(.*)")[,2]
strings

Which yields

# A tibble: 7 x 2
                                 name                    rx
                                <chr>                 <chr>
1          lorem 11:07:59 86136-1-sed           86136-1-sed
2         ipsum 14:35:57 S VARNAME-ut          S VARNAME-ut
3 dolor 10:37:53 1513 -2-perspiciatis  1513 -2-perspiciatis
4                        sit 10:48:25                      
5           amet 13:52:1365293-2-unde               -2-unde
6   consectetur 11:53:1 16018-2-omnis         16018-2-omnis
7       adipiscing 11:19 17237-2-iste          17237-2-iste

Upvotes: 0

acylam
acylam

Reputation: 18681

You can also try this:

library(tidyverse)
library(stringr)

regex = "^\\w+\\s\\d{2}:\\d{2}:*\\d{0,2}"

strings %>%
  mutate(head = str_extract(name, regex),
         end = str_replace(name, paste0(regex, "\\s?"), ""),
         end = str_replace(end, "^\\s*$", NA_character_))

Result:

# A tibble: 7 x 3
                                 name                head                  end
                                <chr>               <chr>                <chr>
1          lorem 11:07:59 86136-1-sed      lorem 11:07:59          86136-1-sed
2         ipsum 14:35:57 S VARNAME-ut      ipsum 14:35:57         S VARNAME-ut
3 dolor 10:37:53 1513 -2-perspiciatis      dolor 10:37:53 1513 -2-perspiciatis
4                        sit 10:48:25        sit 10:48:25                 <NA>
5           amet 13:52:1365293-2-unde       amet 13:52:13         65293-2-unde
6   consectetur 11:53:1 16018-2-omnis consectetur 11:53:1        16018-2-omnis
7       adipiscing 11:19 17237-2-iste    adipiscing 11:19         17237-2-iste

Note:

My solution works for row 5, but you will have to decide whether you want to extract 13:52:13 or 13:52:1 in this case. Either cases can be done with simple modification to the regex, but as stated by @Zach, there is no automatic way.

Upvotes: 1

zlipp
zlipp

Reputation: 801

In general, you'll probably want to look into lookarounds, but your data might need more structure for them to be useful.

Here's a quick example I wrote before realizing the time doesn't always have a space after it:


library(tidyverse)
library(stringr)
strings <- tibble(
  name = c("lorem 11:07:59 86136-1-sed", 
           "ipsum 14:35:57 S VARNAME-ut",
           "dolor 10:37:53 1513 -2-perspiciatis",
           "sit 10:48:25",
           "amet 13:52:1365293-2-unde",
           "consectetur 11:53:1 16018-2-omnis",
           "adipiscing 11:19 17237-2-iste"
  )
)
strings_out <- strings %>% 
  mutate(heads = str_extract(name, "^.*?\\s\\d{1,2}:\\d{1,2}:\\d{1,2}"),
         ends = str_extract(name, "(?<=:\\d{1,2} )[\\s\\S]+$"))

strings_out[c(1,3)]
#> # A tibble: 7 x 2
#>                                  name                 ends
#>                                 <chr>                <chr>
#> 1          lorem 11:07:59 86136-1-sed          86136-1-sed
#> 2         ipsum 14:35:57 S VARNAME-ut         S VARNAME-ut
#> 3 dolor 10:37:53 1513 -2-perspiciatis 1513 -2-perspiciatis
#> 4                        sit 10:48:25                 <NA>
#> 5           amet 13:52:1365293-2-unde                 <NA>
#> 6   consectetur 11:53:1 16018-2-omnis        16018-2-omnis
#> 7       adipiscing 11:19 17237-2-iste         17237-2-iste

The problem here is lines like line 5. Without more structure, we can't know if the time is 13:52:13 or 13:52:1, as both are options present in other strings. Figuring out which is correct is not a problem that can be solved with regular expressions.

Upvotes: 1

Related Questions