ACLAN
ACLAN

Reputation: 401

Extract the first number (with decimals) after a given symbol from a string with multiple numbers in R

I'm trying to get the numbers (including decimals) from a string. My data is similar to this:

V <- c("7.20-<7.35","25-<32","60-<83e","40-<50","0.85-<1.15","80-<98","3.0-<3.4","NA","3.0-<3.4 (110)")

Where numbers are mixed with letters and symbols. I need to extract the first number after the < symbol while keeping the index for the missing values. My output would look like:

desired.output <- c(7.35, 32, 83, 50, 1.15, 98, 3.4, NA, 3.4)

I have tried:

resp <- as.numeric(unlist(regmatches(V,
                 gregexpr("[[:digit:]]+\\.*[[:digit:]]*",V))))
    

and

resp <-  sub(".*<(^[^-])", "\\1", V)

and another patterns in the sub function but nothing seems to work.

What do you suggest as best approach?

Upvotes: 2

Views: 813

Answers (3)

The fourth bird
The fourth bird

Reputation: 163642

You could also specify the first numbers followed by -< and capture the second part with an optional decimal.

\d+(?:\.\d+)?-<(\d+(?:\.\d+)?).*

The pattern matches:

  • \d+(?:\.\d+)? Match 1+ digits with an optional decimal part
  • -< Match literally
  • ( Capture group 1
    • \d+(?:\.\d+)? Match 1+ digits with an optional decimal part
  • ) Close group 1
  • .* Match the rest of the line

Regex demo

Then you can match the rest of the string that you don't want in the result, and replace with group 1.

V <- c("7.20-<7.35","25-<32","60-<83e","40-<50","0.85-<1.15","80-<98","3.0-<3.4","NA","3.0-<3.4 (110)")
sub("\\d+(?:\\.\\d+)?-<(\\d+(?:\\.\\d+)?).*", "\\1", V)

Output

[1] "7.35" "32"   "83"   "50"   "1.15" "98"   "3.4"  "NA"   "3.4" 

Matching all variations of - < or >, you can use a character class listing all the allowed characters and repeat them 1 or more times:

sub("\\d+(?:\\.\\d+)?[<>-]+(\\d+(?:\\.\\d+)?).*", "\\1", V)

Regex demo

Upvotes: 0

Jos&#233;
Jos&#233;

Reputation: 931

Using str_extract from stringr package in tidyverse:

library(tidyverse)
V <- c("7.20-<7.35","25-<32","60-<83e","40-<50","0.85-<1.15","80-<98","3.0-<3.4","NA","3.0-<3.4 (110)")
str_extract(V, "((?<=\\<)\\d\\.?\\d+|NA)") %>% 
      as.numeric()

[1]  7.35 32.00 83.00 50.00  1.15 98.00  3.40    NA  3.40

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

You can use

sub(".*<(\\d+(?:\\.\\d+)?).*", "\\1", V, perl=TRUE)
# => [1] "7.35" "32"   "83"   "50"   "1.15" "98"   "3.4"  "NA"   "3.4" 

See the online R demo and the regex demo. Replace \\d+(?:\\.\\d+)? with \\d*\\.?\\d+ if you need to also get numbers like .05. Append -? before the first \\d+ if you need to also get negative numbers.

Details:

  • .* - any zero or more chars other than line break chars, as many as possible
  • < - a < char
  • (\d+(?:\.\d+)?) - Group 1 (referred to with \1 from the replacement pattern): one or more digits followed with an optional sequence of a dot and one or more digits
  • .* - any zero or more chars other than line break chars, as many as possible

Upvotes: 1

Related Questions