t_goul
t_goul

Reputation: 23

Removing unwanted parts of strings in a list, and combining the pieces into a single string in R

I am trying to take a list of strings, remove everything except capital letters, and output a list of strings without any spaces or breaks.

Unfortunately, I have been trying to use str_extract_all() but it outputs the relevent pieces of the string separated as a list of character vectors, when there was non-capital letter string elements contained in the original string.

Can anyone please suggest a way to get the desired output?

# Some example data:
a <- list("n[28.0313]MVNNGHSFNVEYDDSQDK[28.0313]AVLK[28.0313]D_+4", 
          "SLGKVGTRC[71.0371]CTK[28.0313]PESER_+4",
          "n[28.0313]AVVQDPALK[28.0313]PLALVY_+3",
          "n[28.0313]TCVADESHAGC[71.0371]EK[28.0313]_+2")

# The desired output:
list("MVNNGHSFNVEYDDSQDKAVLKD", 
          "SLGKVGTRCCTKPESER",
          "AVVQDPALKPLALVY",
          "TCVADESHAGCEK")

# What I've tried so far:
a %>% str_extract_all("[A-Z]+")

[[1]]
[1] "MVNNGHSFNVEYDDSQDK" "AVLK"               "D"                 
[[2]]
[1] "SLGKVGTRC" "CTK"       "PESER"    
[[3]]
[1] "AVVQDPALK" "PLALVY"   
[[4]]
[1] "TCVADESHAGC" "EK"  

# Not what I want.

I need to find a way to isolate the strings and combine them, but I'm at the limit of my R knowledge.

Upvotes: 0

Views: 27

Answers (1)

akrun
akrun

Reputation: 887028

As it is a list of multiple elements, we can just paste it together by looping over the list

library(dplyr)
library(stringr)
library(purrr)
a %>%
      str_extract_all("[A-Z]+") %>%
      map_chr(str_c, collapse="")

-output

[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER"  
[3] "AVVQDPALKPLALVY"         "TCVADESHAGCEK"          

Or just use gsub to match all characters other than the upper case and replace with blank

gsub("[^A-Z]+", "", a)
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER"       "AVVQDPALKPLALVY"         "TCVADESHAGCEK"   

or with str_remove_all

str_remove_all(a, "[^A-Z]+")
[1] "MVNNGHSFNVEYDDSQDKAVLKD" "SLGKVGTRCCTKPESER"       "AVVQDPALKPLALVY"         "TCVADESHAGCEK"   

The output is a vector, which we can wrap it in a list

list(str_remove_all(a, "[^A-Z]+"))

Upvotes: 1

Related Questions