Reputation: 518

Regular expression advice

I'm trying to figure out how to extract the names in the character string:

str <- "Bob 1/4 F4 Mary Lou 5/1 Thomas Tank 66/19"

to a vector: "Bob", "Mary Lou", "Thomas Tank"

I have the following which returns "Bob". Can anyone tell me how to match the following globally?

cVec <- ""
findMatch <- regexpr("[^0-9]+", str)
cVec       <- append(cVec, regmatches(str,findMatch))
cVec

Ideally I'd like a list with both the name and fraction elements eg "Bob", "1/4" "Mary Lou", "5/1" "Thomas Tank", "66/19" But I suspect that F4 is going to be difficult (it's not needed). I'd settle for the names!

Cheers.

Upvotes: 1

Answers (5)

Sven Hohenstein

Reputation: 81743

You can extract the names and fractions with the following command:

regmatches(str, gregexpr("[[:alpha:]]+( [[:alpha:]]+)?\\b|\\d+/\\d+", str))
# [[1]]
# [1] "Bob"         "1/4"         "Mary Lou"    "5/1"         "Thomas Tank"
# [6] "66/19"

Upvotes: 4

Rex Kerr

Reputation: 167911

I'm not familiar with R's regex syntax, but the following Java regex matches the whole expression (\s means whitespace; \d means a digit, [0-9]; () is a group; R seems to agree):

"([A-Za-z]+\\s)+(\\d+/\\d+(\\s[A-Z][\\d+])?)"

In Java there's a find method that lets you walk through pattern matches. In R, I think it's gregexpr, except this gives you a list of indices, not the strings themselves.

Upvotes: 0

The Guy with The Hat

Reputation: 11132

I don't know R, so I can't provide you with implementation. However, I think a solution could be made with this regex:

(?<=^| )[a-zA-Z]+(?: [a-zA-Z]+)?(?= |$)|[0-9]+/[0-9]+

It will match Bob, 1/4, Mary Lou, 5/1, Thomas Tank, and 66/19, but not F4.

Online explanation and demonstration here: http://regex101.com/r/vB8rU5

Upvotes: 2

Casimir et Hippolyte

Reputation: 89639

you can do it like this:

str <- "Bob 1/4 F4 Mary Lou 5/1 Thomas Tank 66/19"
m<-gregexpr("(?i)\\b[a-z]+(?: [a-z]+)*\\b", str)
regmatches(str, m)

Upvotes: 0

Raffael

Reputation: 20045

At the end of the day this is way to fuzzy to give a solid/general solution. But this would do the trick and you would just have to trim the names:

> strsplit(str, "[0-9][ 0-9F/]+[0-9]")[[1]]
[1] "Bob "          " Mary Lou "    " Thomas Tank "

The regular expression defines what the split looks like.

Upvotes: 0

Regular expression advice

Answers (5)

Related Questions