Grep in R matching getting non-digits

Question

I need to get the non-digit part of a character. I have problem with this regex in R (which according to regexpal should work):

grep("[\D]+", "PC 17610", value = TRUE, perl = F)

It should return "PC " while it returns character(0)

Other test cases:

grep("[\D]+", "STON/O2. 3101282    ", value = TRUE, perl = F)
# should return "STON/O2."
grep("[\D]+", "S.C./A.4. 23567", value = TRUE, perl = F)
# should return "S.C./A.4."
grep("[\D]+", "C.A. 31026", value = TRUE, perl = F)
# should return "C.A."

Update:

The job is to divide column "Ticket" (from the Titanic disaster database) into "TicketNumber" and "TicketSeries" columns. As for now, Ticket holds below e.g. values: "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803". So the ticket number column is for the first record 21171 and ticket series column "A/5", and so on for next records.

For the record "113803", TicketNumber should be "113803" and TicketSeries NA.

Help appreciated, Thanks!

hwnd · Accepted Answer

Use sub instead, utilizing the \S regex token to match any non-whitespace characters.

x <- c('PC 17610', 'STON/O2. 3101282    ', 'S.C./A.4. 23567', 'C.A. 31026')
sub('(\S+).*', '\1', x)
# [1] "PC"        "STON/O2."  "S.C./A.4." "C.A."

EDIT

Otherwise, if you want to return NA for invalid or empty matches, I suppose you could do ...

x <- c('PC 17610', 'STON/O2. 3101282    ', 'S.C./A.4. 23567', 'C.A. 31026', '31026')
r <- regmatches(x, gregexpr('^\S+(?=\s+)', x, perl=T))
unlist({r[sapply(r, length)==0] <- NA; r})
# [1] "PC"        "STON/O2."  "S.C./A.4." "C.A."      NA

Grep in R matching getting non-digits

Answers (2)

EDIT

data

Related Questions