Thomas Browne
Thomas Browne

Reputation: 24888

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:

USDZAR Curncy
R157 Govt
SPX Index

In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:

USDZAR
R157
SPX

What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:

=MID(@REF, 1, FIND(" ", @REF, 1)-1)

which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).

Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.

Upvotes: 11

Views: 11387

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 269644

1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:

x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157"   "SPX"  

2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.

read.table(text = x, as.is = TRUE)
##       V1     V2
## 1 USDZAR Curncy
## 2   R157   Govt
## 3    SPX  Index

Upvotes: 23

joran
joran

Reputation: 173577

If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:

x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))

The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.

Edited to reflect @Wojciech's comment.

Upvotes: 2

hadley
hadley

Reputation: 103898

It's pretty easy with stringr:

x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")

library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]

Upvotes: 4

MRAB
MRAB

Reputation: 20654

The regex would be to search for:

\x20.*

and replace with an empty string.

If you want to know whether it's faster, just time it.

Upvotes: 1

Related Questions