Reputation: 691
I have a list of birthdays that look something like this:
dob <- c("9/9/43 12:00 AM/PM", "9/17/88 12:00 AM/PM", "11/21/48 12:00 AM/PM")
I want to just grab the calendar date from this variable (ie drop everything after the first occurrence of white-space).
Here's what I have tried so far:
dob.abridged <- substring(dob,1,8)
dob
[1] "9/9/43 1" "9/17/88 " "11/21/48"
dob.abridged <- gsub(" $","", dob.abridged, perl=T)
> dob.abridged
[1] "9/9/43 1" "9/17/88" "11/21/48"
So my code works for calendar dates of length 6 or 7, but not length 8. Any pointers on a more effective regex to use with gsub that can handle calendar dates of length 6, 7 or 8?
Thank you.
Upvotes: 65
Views: 103063
Reputation: 53
Another way to extract characters from alphabet before a white space is:
You have to install the package: "stringr"
stringr::str_extract(c("juan carlos", "miguel angel"), stringr::regex(pattern = "[a-z]+(?=\\s)", ignore_case = F))
[a-z]
: matches every character between a and z (in Unicode code point order).
+
: 1 or more.
(?=\\s)
: Lookahead, followed by \s (which is white space) (not matching \s).
More info: https://stringr.tidyverse.org/articles/regular-expressions.html
Upvotes: 1
Reputation: 1354
Another regex pattern to extract the date only
library(stringr)
str_extract(dob, regex("\\d{1,}\\/\\d{1,}\\/\\d{1,}"))
#[1] "9/9/43" "9/17/88" "11/21/48"
\\d{1,}
: Matches digits at least 1 time\\/
: Escapes forward slashUpvotes: -1
Reputation: 361
The library stringr
contains a function tailored to this problem.
library(stringr)
word(dob,1)
# [1] "9/9/43" "9/17/88" "11/21/48"
Upvotes: 17
Reputation: 109844
I often use strsplit
for these sorts of problems but liked how simple Romain's answer was. I thought it would be interesting to compare Romain's solution to a strsplit
answer:
Here's a strsplit
solution:
sapply(strsplit(dob, "\\s+"), "[", 1)
Using the microbenchmark package and dob <- rep(dob, 1000)
with the original data:
Unit: milliseconds
expr min lq median
gsub(" .*$", "", dob) 4.228843 4.247969 4.258232
sapply(strsplit(dob, "\\\\s+"), "[", 1) 14.438241 14.558832 14.634638
uq max neval
4.268029 5.081608 1000
14.756628 53.344984 1000
The clear winner on a Win 7 machine is the gsub
regex from Romain. Thanks for the answer and explanation Romain.
Upvotes: 16
Reputation: 17642
No need for substring, just use gsub:
gsub( " .*$", "", dob )
# [1] "9/9/43" "9/17/88" "11/21/48"
A space (), then any character (
.
) any number of times (*
) until the end of the string ($
). See ?regex to learn regular expressions.
Upvotes: 140