user2579376
user2579376

Reputation: 7

How to extract a part from a string in R

I have a problem when I tried to obtain a numeric part in R. The original strings, for example, is "buy 1000 shares of Google at 1100 GBP"

I need to extract the number of the shares (1000) and the price (1100) separately. Besides, I need to extract the number of the stock, which always appears after "shares of".

I know that sub and gsub can replace string, but what commands should I use to extract part of a string?

Upvotes: 1

Views: 500

Answers (4)

bartektartanus
bartektartanus

Reputation: 16080

If you want to extract all digits from text use this function from stringi package.

"Nd" is the class of decimal digits.

    stri_extract_all_charclass(c(123,43,"66ala123","kot"),"\\p{Nd}")
[[1]]
[1] "123"

[[2]]
[1] "43"

[[3]]
[1] "66"  "123"

[[4]]
[1] NA

Please note that here 66 and 123 numbers are extracted separatly.

Upvotes: 0

G. Grothendieck
G. Grothendieck

Reputation: 269461

1) This extracts all numbers in order:

s <- "buy 1000 shares of Google at 1100 GBP"

library(gsubfn)
strapplyc(s, "[0-9.]+", simplify = as.numeric)

giving:

[1] 1000 1100

2) If the numbers can be in any order but if the number of shares is always followed by the word "shares" and the price is always followed by GBP then:

strapplyc(s, "(\\d+) shares", simplify = as.numeric) # 1000
strapplyc(s, "([0-9.]+) GBP", simplify = as.numeric) # 1100

The portion of the string matched by the part of the regular expression within parens is returned.

3) If the string is known to be of the form: X shares of Y at Z GBP then X, Y and Z can be extracted like this:

strapplyc(s, "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = c)

ADDED Modified pattern to allow either digits or a dot. Also added (3) above and the following:

strapply(c(s, s), "[0-9.]+", as.numeric)
strapply(c(s, s), "[0-9.]+", as.numeric, simplify = rbind) # if ea has same no of matches

strapply(c(s, s), "(\\d+) shares", as.numeric, simplify = c)
strapply(c(s, s), "([0-9.]+) GBP", as.numeric, simplify = c)

strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP")
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = rbind)

Upvotes: 2

hrbrmstr
hrbrmstr

Reputation: 78792

I feel compelled to include the obligatory stringr solution as well.

library(stringr)

s <- "buy 1000 shares of Google at 1100 GBP"

str_match(s, "([0-9]+) shares")[2]
[1] "1000"

str_match(s, "([0-9]+) GBP")[2]
[1] "1100"

Upvotes: 0

Sven Hohenstein
Sven Hohenstein

Reputation: 81683

You can use the sub function:

s <- "buy 1000 shares of Google at 1100 GBP"

# the number of shares
sub(".* (\\d+) shares.*", "\\1", s)
# [1] "1000"

# the stock
sub(".*shares of (\\w+) .*", "\\1", s)
# [1] "Google"

# the price
sub(".* at (\\d+) .*", "\\1", s)
# [1] "1100"

You can also use gregexpr and regmatches to extract all substrings at once:

regmatches(s, gregexpr("\\d+(?= shares)|(?<=shares of )\\w+|(?<= at )\\d+", 
                       s, perl = TRUE))
# [[1]]
# [1] "1000"   "Google" "1100"  

Upvotes: 1

Related Questions