ThatGuy
ThatGuy

Reputation: 1233

Extract numbers from strings including '|'

I have data where some of the items are numbers separated by "|", like:

head(mintimes)
[1] "3121|3151" "1171"      "1351|1381" "1050"      ""          "122" 
head(minvalues)
[1] 14    10    11    31 Inf    22

What I would like to do is extract all the times and match them to the minvalues. To end up with something like:

times    values
3121     14
3151     14
1171     10
1351     11
1381     11
1050     31
122      22

I've tried to strsplit(mintimes, "|") and I've tried str_extract(mintimes, "[0-9]+") but they don't seem to work. Any ideas?

Upvotes: 6

Views: 188

Answers (8)

Tyler Rinker
Tyler Rinker

Reputation: 109874

Here's a qdap package approach:

mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122") 
minvalues <- c(14, 10, 11, 31, Inf, 22)

library(qdap)
list2df(setNames(strsplit(mintimes, "\\|"), minvalues), "times", "values")

##   times values
## 1  3121     14
## 2  3151     14
## 3  1171     10
## 4  1351     11
## 5  1381     11
## 6  1050     31
## 7   122     22

Upvotes: 1

gagolews
gagolews

Reputation: 13056

And here's another solution using stri_split_fixed from the stringi package. As an added value, we also play with mapply and do.call.

Input data:

mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122") 
minvalues <- c(14, 10, 11, 31, Inf, 22)

Split mintimes w.r.t. | and convert to numeric:

library("stringi")    
mintimes <- lapply(stri_split_fixed(mintimes, "|"), as.numeric)
## [[1]]
## [1] 3121 3151
## 
## [[2]]
## [1] 1171
## 
## [[3]]
## [1] 1351 1381
## 
## [[4]]
## [1] 1050
## 
## [[5]]
## [1] NA
## 
## [[6]]
## [1] 122

Column-bind each minvalues with corresponding mintimes:

tmp <- mapply(cbind, mintimes, minvalues)
## [[1]]
##      [,1] [,2]
## [1,] 3121   14
## [2,] 3151   14
## 
## [[2]]
##      [,1] [,2]
## [1,] 1171   10
## 
## [[3]]
##      [,1] [,2]
## [1,] 1351   11
## [2,] 1381   11
## 
## [[4]]
##      [,1] [,2]
## [1,] 1050   31
## 
## [[5]]
##      [,1] [,2]
## [1,]   NA  Inf
## 
## [[6]]
##      [,1] [,2]
## [1,]  122   22

Row-bind all the 6 matrices & remove NA-rows:

res <- do.call(rbind, tmp)
res[!is.na(res[,1]),]
##      [,1] [,2]
## [1,] 3121   14
## [2,] 3151   14
## [3,] 1171   10
## [4,] 1351   11
## [5,] 1381   11
## [6,] 1050   31
## [7,]  122   22

Upvotes: 2

Rich Scriven
Rich Scriven

Reputation: 99341

| is a regular expression metacharacter. When used literally, these special characters need to be escaped either with [] or with \\ (or you could use fixed = TRUE in some functions). So your call to strsplit() should be

strsplit(mintimes, "[|]")

or

strsplit(mintimes, "\\|")

or

strsplit(mintimes, "|", fixed = TRUE)

Regarding your other try with stringr functions, str_extract_all() seems to do the trick.

library(stringr)
str_extract_all(mintimes, "[0-9]+")

To get your desired result,

> mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122") 
> minvalues <- c(14, 10, 11, 31, Inf, 22)
> s <- strsplit(mintimes, "[|]")
> data.frame(times = as.numeric(unlist(s)), 
             values = rep(minvalues, sapply(s, length)))
#   times values
# 1  3121     14
# 2  3151     14
# 3  1171     10
# 4  1351     11
# 5  1381     11
# 6  1050     31
# 7   122     22

Upvotes: 6

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193547

I have written a function called cSplit that is useful for these types of things. You can get it from my Gist: https://gist.github.com/mrdwab/11380733

Usage would be:

cSplit(data.table(mintimes, minvalues), "mintimes", "|", "long")
#    mintimes minvalues
# 1:     3121        14
# 2:     3151        14
# 3:     1171        10
# 4:     1351        11
# 5:     1381        11
# 6:     1050        31
# 7:      122        22

It also has a "wide" setting, in case that would be at all useful to you:

cSplit(data.table(mintimes, minvalues), "mintimes", "|", "wide")
#    minvalues mintimes_1 mintimes_2
# 1:        14       3121       3151
# 2:        10       1171         NA
# 3:        11       1351       1381
# 4:        31       1050         NA
# 5:       Inf         NA         NA
# 6:        22        122         NA

Note: The output is a data.table.

Upvotes: 3

rrs
rrs

Reputation: 9913

To get the output you want, try something like this:

library(dplyr)

Split.Times <- function(x) {
  mintimes <- as.numeric(unlist(strsplit(as.character(x$mintimes), "\\|")))
  return(data.frame(mintimes = mintimes, minvalues = x$minvalues, stringsAsFactors=FALSE))
}

df <- data.frame(mintimes, minvalues, stringsAsFactors=FALSE)

df %>%
  filter(mintimes != "") %>%
  group_by(mintimes) %>%
  do(Split.Times(.))

This produces:

  mintimes minvalues
1     1050        31
2     1171        10
3      122        22
4     1351        11
5     1381        11
6     3121        14
7     3151        14

(I borrowed from my answer here - which is pretty much the same question/problem)

Upvotes: 1

jbaums
jbaums

Reputation: 27388

As others have mentioned, you need to escape the | to include it literally in a regular expression. As always, we can skin this cat many ways, and here's one way to do it with stringr:

x <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")

library(stringr)
unlist(str_extract_all(x, "\\d+"))

# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122"

This won't work as expected if you have any decimal points in a character string of numbers, so the following (which says to match anything but |) might be safer:

unlist(str_extract_all(x, '[^|]+'))

# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122" 

Either way, you might want to wrap the result in as.numeric.

Upvotes: 2

MrFlick
MrFlick

Reputation: 206308

By default strsplit splits using a regular expression and "|" is a special character in the regular expression syntax. You can either escape it

strsplit(mintimes,"\\|")

or just set fixed=T to not use regular expressions

strsplit(mintimes,"|", fixed=T)

Upvotes: 4

ThatGuy
ThatGuy

Reputation: 1233

You can use [:punct:]

strsplit(mintimes, "[[:punct:]]")

Upvotes: 0

Related Questions