oski86
oski86

Reputation: 865

Grep in R matching getting non-digits

I need to get the non-digit part of a character. I have problem with this regex in R (which according to regexpal should work):

grep("[\\D]+", "PC 17610", value = TRUE, perl = F)

It should return "PC " while it returns character(0)

Other test cases:

grep("[\\D]+", "STON/O2. 3101282    ", value = TRUE, perl = F)
# should return "STON/O2."
grep("[\\D]+", "S.C./A.4. 23567", value = TRUE, perl = F)
# should return "S.C./A.4."
grep("[\\D]+", "C.A. 31026", value = TRUE, perl = F)
# should return "C.A."

Update:

The job is to divide column "Ticket" (from the Titanic disaster database) into "TicketNumber" and "TicketSeries" columns. As for now, Ticket holds below e.g. values: "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803". So the ticket number column is for the first record 21171 and ticket series column "A/5", and so on for next records.

For the record "113803", TicketNumber should be "113803" and TicketSeries NA.

Help appreciated, Thanks!

Upvotes: 1

Views: 430

Answers (2)

hwnd
hwnd

Reputation: 70750

Use sub instead, utilizing the \S regex token to match any non-whitespace characters.

x <- c('PC 17610', 'STON/O2. 3101282    ', 'S.C./A.4. 23567', 'C.A. 31026')
sub('(\\S+).*', '\\1', x)
# [1] "PC"        "STON/O2."  "S.C./A.4." "C.A."

EDIT

Otherwise, if you want to return NA for invalid or empty matches, I suppose you could do ...

x <- c('PC 17610', 'STON/O2. 3101282    ', 'S.C./A.4. 23567', 'C.A. 31026', '31026')
r <- regmatches(x, gregexpr('^\\S+(?=\\s+)', x, perl=T))
unlist({r[sapply(r, length)==0] <- NA; r})
# [1] "PC"        "STON/O2."  "S.C./A.4." "C.A."      NA 

Upvotes: 3

akrun
akrun

Reputation: 887971

You can use str_extract

library(stringr)
str_extract(x, '\\S+(?=\\s+)')
#[1] "PC"        "STON/O2."  "S.C./A.4." "C.A."      NA       

data

x <- c('PC 17610', 'STON/O2. 3101282    ', 'S.C./A.4. 23567', 
        'C.A. 31026', '31026')

Upvotes: 1

Related Questions