Reputation: 865
I need to get the non-digit part of a character. I have problem with this regex in R (which according to regexpal should work):
grep("[\\D]+", "PC 17610", value = TRUE, perl = F)
It should return "PC "
while it returns character(0)
Other test cases:
grep("[\\D]+", "STON/O2. 3101282 ", value = TRUE, perl = F)
# should return "STON/O2."
grep("[\\D]+", "S.C./A.4. 23567", value = TRUE, perl = F)
# should return "S.C./A.4."
grep("[\\D]+", "C.A. 31026", value = TRUE, perl = F)
# should return "C.A."
Update:
The job is to divide column "Ticket"
(from the Titanic disaster database) into "TicketNumber"
and "TicketSeries"
columns. As for now, Ticket holds below e.g. values: "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803"
. So the ticket number column is for the first record 21171
and ticket series column "A/5"
, and so on for next records.
For the record "113803"
, TicketNumber should be "113803"
and TicketSeries NA
.
Help appreciated, Thanks!
Upvotes: 1
Views: 430
Reputation: 70750
Use sub
instead, utilizing the \S
regex token to match any non-whitespace characters.
x <- c('PC 17610', 'STON/O2. 3101282 ', 'S.C./A.4. 23567', 'C.A. 31026')
sub('(\\S+).*', '\\1', x)
# [1] "PC" "STON/O2." "S.C./A.4." "C.A."
Otherwise, if you want to return NA for invalid or empty matches, I suppose you could do ...
x <- c('PC 17610', 'STON/O2. 3101282 ', 'S.C./A.4. 23567', 'C.A. 31026', '31026')
r <- regmatches(x, gregexpr('^\\S+(?=\\s+)', x, perl=T))
unlist({r[sapply(r, length)==0] <- NA; r})
# [1] "PC" "STON/O2." "S.C./A.4." "C.A." NA
Upvotes: 3
Reputation: 887971
You can use str_extract
library(stringr)
str_extract(x, '\\S+(?=\\s+)')
#[1] "PC" "STON/O2." "S.C./A.4." "C.A." NA
x <- c('PC 17610', 'STON/O2. 3101282 ', 'S.C./A.4. 23567',
'C.A. 31026', '31026')
Upvotes: 1