Reputation: 83

Extract numbers between characters in R

I need to separate the "value" variable in the following dataset into three variables: estimate, low, high. Note that sometimes there are no confidence intervals, so I just have the value.

country gho year    publishstate    value
Afghanistan Raised fasting blood glucose (>=7.0 mmol/L or on medication)(age-standardized estimate) 1980    Published   4.9 [2.5-8.6]
Afghanistan Raised fasting blood glucose (>=7.0 mmol/L or on medication)(age-standardized estimate) 1981    Published   5.1 [2.7-8.5]
Afghanistan Raised fasting blood glucose (>=7.0 mmol/L or on medication)(age-standardized estimate) 1982    Published   5.2 [2.9-8.5]
Afghanistan Raised fasting blood glucose (>=7.0 mmol/L or on medication)(age-standardized estimate) 1983    Published   5.4 [3.1-8.6]

I have tried this:

Data$estimate <- sub("\\[.*","",Data$value)

but it only works for creating the variable estimate. I was thinking of using strsplit but it does not do the trick neither...

could you help on that one?

Thank you very much,

Upvotes: 2

Answers (3)

Allan Cameron

Reputation: 173793

Here's another way to do it using only base R

lapply(strsplit(Data$value, "[^[:digit:].]"), function(x) as.numeric(x[x != ""]))
# [[1]]
# [1] 4.9 2.5 8.6
#
# [[2]]
# [1] 5.1 2.7 8.5
#
# [[3]]
# [1] 5.2 2.9 8.5
#
# [[4]]
# [1] 5.4 3.1 8.6

Upvotes: 0

Sergi Domingo

Reputation: 45

Using tidyr:

separate(df, value, c("estimate", "low", "high"), sep = "\\s\\[|-|\\]")

Hope this helps.

Upvotes: 0

G. Grothendieck

Reputation: 269481

Using the data shown in the Note in reproducible form, we can use separate as shown. The fill="right" argument causes lower and upper to be filled in with NAs if only one subfield is listed in value.

library(dplyr)
library(tidyr)
DF %>%
  separate(value, c("value", "lower", "upper", NA), sep = "[^0-9.]+", fill = "right")

Note

Lines <- "country,glucose,year,publishstate,value
Afghanistan,Raised fasting blood glucose (>=7.0 mmol/L or on medication)(age-standardized estimate),1980,Published,4.9 [2.5-8.6]
Afghanistan,Raised fasting blood glucose (>=7.0 mmol/L or on medication)(age-standardized estimate),1981,Published,5.1 [2.7-8.5]
Afghanistan,Raised fasting blood glucose (>=7.0 mmol/L or on medication)(age-standardized estimate),1982,Published,5.2 [2.9-8.5]
Afghanistan,Raised fasting blood glucose (>=7.0 mmol/L or on medication)(age-standardized estimate),1983,Published,5.4 [3.1-8.6]"
DF <- read.csv(text = Lines, header = TRUE, as.is = TRUE)

Upvotes: 5

Extract numbers between characters in R

Answers (3)

Note

Related Questions