karuno
karuno

Reputation: 411

Extract portion of string with punctuation

I have a string:

string <- "newdatat.scat == \"RDS16\" ~ \"Asthma\","

and I want to extract separately:

RDS16
Asthma

What I've tried so far is:

extract <- str_extract(string,'~."(.+)')

but I am only able to get:

~ \"Asthma\",

If you have an answer, can you also kindly explain the regex behind it? I'm having a hard time converting string patterns to regex.

Upvotes: 0

Views: 202

Answers (3)

Ronak Shah
Ronak Shah

Reputation: 389135

You can capture the two values in two separate columns.

In stringr use str_match -

string <- "newdatat.scat == \"RDS16\" ~ \"Asthma\","
stringr::str_match(string, '"(\\w+)" ~ "(\\w+)"')[, -1, drop = FALSE]

#        [,1]    [,2]    
#[1,] "RDS16" "Asthma"

Or in base R use strcapture

strcapture('"(\\w+)" ~ "(\\w+)"', string, 
           proto = list(col1 = character(), col2 = character()))

#   col1   col2
#1 RDS16 Asthma

Upvotes: 0

hello_friend
hello_friend

Reputation: 5788

Base R solutions:

# Solution 1: 
# Extract strings (still quoted): 
# dirtyStrings => list of strings
dirtyStrings <- regmatches(
  string, 
  gregexpr(
    '".*?"', 
    string
  )
)

# Iterate over the list and "clean" - unquote - each
# element, store as a vector: result => character vector
result <- c( 
  vapply(
    dirtyStrings,
    function(x){
      noquote(
        gsub(
          '"', 
          '', 
          x
        )
      )  
    }, 
    character(
      lengths(
        dirtyStrings
      )
    )
  )
)

# Solution 2: 
# Same as above, less generic -- assumes all strings 
# will follow the same pattern: result => character vector
result <- unlist(
  lapply(
    strsplit(
      gsub(
        ".*\\=\\=", 
        "",
        noquote(
          string
          )
        ),
    "~"), 
    function(x){
      gsub(
        "\\W+", 
        "", 
        noquote(x)
      )
    }
  )
)

Upvotes: 0

Calum You
Calum You

Reputation: 15072

If you just need to extract sections surrounded by ", then you can use something like the following. The pattern ".*?" matches first ", then .*? meaning as few characters as possible, before finally matching another ". This will get you the strings including the " double quotes; you then just have to remove the double quotes to clean up.

Note that str_extract_all is used to return all matches, and that it returns a list of character vectors so we need to index into the list before removing the double quotes.

library(stringr)
string <- "newdatat.scat == \"RDS16\" ~ \"Asthma\","

str_extract_all(string, '".*?"') %>%
  `[[`(1) %>%
  str_remove_all('"')
#> [1] "RDS16"  "Asthma"

Created on 2021-06-21 by the reprex package (v1.0.0)

Upvotes: 3

Related Questions