Al14
Al14

Reputation: 1814

Use apply to extract parenthesis content from dataframes

I want to extract only the content inside the parenthesis, this works pretty well if I use vectors (example below):

j<-"[8] Q(+.98)"
gsub("[\\(\\)]", "", regmatches(j, gregexpr("\\(.*?\\)", j))[[1]])

Now I want to to use apply to run the above code in multiple columns in a dataframe. Below is what I have done which does not work.

a<-c("[7] C(+57.02)",  "[11] C(+57.02)",  NA, NA)
b<- c("[16] C(+57.02)",   NA, NA,NA)
c<-c("[9] Q(+.98)" ,   "[13] Q(+.98)" , "[14] C(+57.02)",NA)
abc<-as.data.frame(rbind(a,b,c))

abc_in<-apply(abc, 2, function(x) 
  gsub("[\\(\\)]", "", regmatches(x, gregexpr("\\(.*?\\)", x))[[1]]))

Upvotes: 0

Views: 57

Answers (3)

Rich Scriven
Rich Scriven

Reputation: 99361

You don't need apply() or any packages. Since we're operating on the entire data frame, we can first coerce it to a matrix then just use sub().

sub(".*\\((.+)\\).*", "\\1", as.matrix(abc))
#   V1       V2       V3       V4
# a "+57.02" "+57.02" NA       NA
# b "+57.02" NA       NA       NA
# c "+.98"   "+.98"   "+57.02" NA

That gives you a matrix back. If you need to retain the data frame structure, then

abc[] <- sub(".*\\((.+)\\).*", "\\1", as.matrix(abc))

Of course, you could loop the data frame columns. But for that I would go with lapply() over apply() since a data frame is a list.

abc[] <- lapply(abc, sub, pattern = ".*\\((.+)\\).*", replacement = "\\1")

Coercion is done implicitly by sub(), so starting with factors is not an issue.

Upvotes: 2

denis
denis

Reputation: 5673

It does what you tell him, that is to take only the first element of the regmatches list for each column. I advise to use str_extract from stringr package, that gives a vector as a result and is easier to write and use:

library(stringr)
abs_in <- apply(abc,2,function(x){ gsub("[\\(\\)]", "",str_extract(x,"\\(.*?\\)"))})
> abs_in
     V1       V2       V3       V4
[1,] "+57.02" "+57.02" NA       NA
[2,] "+57.02" NA       NA       NA
[3,] "+.98"   "+.98"   "+57.02" NA    

Upvotes: 1

Jan
Jan

Reputation: 43179

Using stringr, additionally, you need to specify stringsAsFactors = FALSE when binding the dataframe:

abc<-as.data.frame(rbind(a,b,c), stringsAsFactors = FALSE)

library(stringr)
regex <- "\\(([^()]+)\\)"
str_match_all(abc, regex)

This yields

[[1]]
     [,1]       [,2]    
[1,] "(+57.02)" "+57.02"
[2,] "(+57.02)" "+57.02"
[3,] "(+.98)"   "+.98"  

[[2]]
     [,1]       [,2]    
[1,] "(+57.02)" "+57.02"
[2,] "(+.98)"   "+.98"  

[[3]]
     [,1]       [,2]    
[1,] "(+57.02)" "+57.02"

[[4]]
     [,1]           [,2]        
[1,] "(NA, NA, NA)" "NA, NA, NA"

Just always take the second group.

Upvotes: 0

Related Questions