Reputation: 15072
I have a vector of extremely messy strings. Here is an example:
library(tidyverse)
library(stringr)
strings <- tibble(
name = c("lorem 11:07:59 86136-1-sed",
"ipsum 14:35:57 S VARNAME-ut",
"dolor 10:37:53 1513 -2-perspiciatis",
"sit 10:48:25",
"amet 13:52:1365293-2-unde",
"consectetur 11:53:1 16018-2-omnis",
"adipiscing 11:19 17237-2-iste"
)
)
strings_out <- strings %>%
mutate(heads = str_extract(name, "^.*?\\s\\d{1,2}:\\d{1,2}:\\d{1,2}")) %>%
mutate(ends = str_replace(name, "^.*?\\s\\d{1,2}:\\d{1,2}:\\d{1,2}", ""))
strings_out[,2:3]
#> # A tibble: 7 x 2
#> heads ends
#> <chr> <chr>
#> 1 lorem 11:07:59 86136-1-sed
#> 2 ipsum 14:35:57 S VARNAME-ut
#> 3 dolor 10:37:53 1513 -2-perspiciatis
#> 4 sit 10:48:25
#> 5 amet 13:52:13 65293-2-unde
#> 6 consectetur 11:53:1 16018-2-omnis
#> 7 <NA> adipiscing 11:19 17237-2-iste
So here I have strings that feature some text, followed by a time that may or may not be entered correctly, then some more text. I want to extract just the ends of the strings after the time, however they do not have any pattern that seems to correspond well to a potential regular expression using str_extract
. I can easily match the first half of the strings, shown in heads
. However, the only way that I found to extract the last half is to use str_replace
with an empty string, as shown in ends
.
I tried to include all the common errors that I noticed in this list: no pattern as to the hyphenation, spacing or string contents after the time, no guaranteed space betwene the time and the desired end half of the string, times missing digits or even colons.
What I would like to do is to be able to use str_extract
to get something close to what I got with str_replace
. The key difference is that for the errors where this regex still does not work, str_extract
gives me an NA
that is easy to filter for and fix manually, but str_replace
just copies in the whole string as seen in row 7.
I suspect I could do this with some more hacky methods, like getting all the NA
and fixing manually in Excel or something, but I was surprised that I could not figure out how to return the unmatched portion of a string in general despite a bunch of searching and trying different regular expressions that include (^)
and [^]
. Any ideas?
Upvotes: 3
Views: 508
Reputation: 43169
You can have it with just one additional line:
strings["rx"] <- str_match(strings$name, "\\d*:\\d*(?::\\d+)?(.*)")[,2]
strings
Which yields
# A tibble: 7 x 2
name rx
<chr> <chr>
1 lorem 11:07:59 86136-1-sed 86136-1-sed
2 ipsum 14:35:57 S VARNAME-ut S VARNAME-ut
3 dolor 10:37:53 1513 -2-perspiciatis 1513 -2-perspiciatis
4 sit 10:48:25
5 amet 13:52:1365293-2-unde -2-unde
6 consectetur 11:53:1 16018-2-omnis 16018-2-omnis
7 adipiscing 11:19 17237-2-iste 17237-2-iste
Upvotes: 0
Reputation: 18681
You can also try this:
library(tidyverse)
library(stringr)
regex = "^\\w+\\s\\d{2}:\\d{2}:*\\d{0,2}"
strings %>%
mutate(head = str_extract(name, regex),
end = str_replace(name, paste0(regex, "\\s?"), ""),
end = str_replace(end, "^\\s*$", NA_character_))
Result:
# A tibble: 7 x 3
name head end
<chr> <chr> <chr>
1 lorem 11:07:59 86136-1-sed lorem 11:07:59 86136-1-sed
2 ipsum 14:35:57 S VARNAME-ut ipsum 14:35:57 S VARNAME-ut
3 dolor 10:37:53 1513 -2-perspiciatis dolor 10:37:53 1513 -2-perspiciatis
4 sit 10:48:25 sit 10:48:25 <NA>
5 amet 13:52:1365293-2-unde amet 13:52:13 65293-2-unde
6 consectetur 11:53:1 16018-2-omnis consectetur 11:53:1 16018-2-omnis
7 adipiscing 11:19 17237-2-iste adipiscing 11:19 17237-2-iste
Note:
My solution works for row 5, but you will have to decide whether you want to extract 13:52:13
or 13:52:1
in this case. Either cases can be done with simple modification to the regex, but as stated by @Zach, there is no automatic way.
Upvotes: 1
Reputation: 801
In general, you'll probably want to look into lookarounds, but your data might need more structure for them to be useful.
Here's a quick example I wrote before realizing the time doesn't always have a space after it:
library(tidyverse)
library(stringr)
strings <- tibble(
name = c("lorem 11:07:59 86136-1-sed",
"ipsum 14:35:57 S VARNAME-ut",
"dolor 10:37:53 1513 -2-perspiciatis",
"sit 10:48:25",
"amet 13:52:1365293-2-unde",
"consectetur 11:53:1 16018-2-omnis",
"adipiscing 11:19 17237-2-iste"
)
)
strings_out <- strings %>%
mutate(heads = str_extract(name, "^.*?\\s\\d{1,2}:\\d{1,2}:\\d{1,2}"),
ends = str_extract(name, "(?<=:\\d{1,2} )[\\s\\S]+$"))
strings_out[c(1,3)]
#> # A tibble: 7 x 2
#> name ends
#> <chr> <chr>
#> 1 lorem 11:07:59 86136-1-sed 86136-1-sed
#> 2 ipsum 14:35:57 S VARNAME-ut S VARNAME-ut
#> 3 dolor 10:37:53 1513 -2-perspiciatis 1513 -2-perspiciatis
#> 4 sit 10:48:25 <NA>
#> 5 amet 13:52:1365293-2-unde <NA>
#> 6 consectetur 11:53:1 16018-2-omnis 16018-2-omnis
#> 7 adipiscing 11:19 17237-2-iste 17237-2-iste
The problem here is lines like line 5. Without more structure, we can't know if the time is 13:52:13
or 13:52:1
, as both are options present in other strings. Figuring out which is correct is not a problem that can be solved with regular expressions.
Upvotes: 1