Remove string in parenthesis and add that as a new column

Question

Possible duplicate Here

I have a data frame of two columns. I want to remove the string in parenthesis and add that as a new column. Data frame is displayed below.

      structure(list(ID = 1:12, Gene.Name = structure(c(3L, 11L, 9L, 
5L, 1L, 8L, 2L, 4L, 6L, 12L, 10L, 7L), .Label = c(" ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA", 
" heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA", " NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA", 
" ribosomal protein L34 (RPL34), transcript variant 1, mRNA", 
" ribosomal protein S11 (RPS11), mRNA", "ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA", 
"clone MGC:10120 IMAGE:3900723, mRNA, complete cds", "cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA", 
"farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA", "homeobox protein from AL590526 (LOC84528), mRNA", 
"mitochondrial  S33 (MRPS33), transcript variant 1, nuclear gene, mRNA", 
"ribosomal protein S15a (RPS15A), mRNA"), class = "factor")), .Names = c("ID", 
"Gene.Name"), row.names = c(NA, -12L), class = "data.frame")

if the string in parenthesis is not found, then leave that row empty. Here i have two cases

1) Get all the string in parenthesis and add as a new column separated by ,

2) Last string in parenthesis and add as new column

I tried something like df$Symbol <- sapply(df, function(x) sub("\).*", "", sub(".*\(", "", x))) but does not give the appropriate output

Case 1 output

ID  Gene.Name                                                                                    Symbol
1    NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA                       ubiquinone, (9kD, MLRQ),NDUFA4
2   mitochondrial  S33 (MRPS33), transcript variant 1, nuclear gene, mRNA                      MRPS33
3   farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA                                   FDFT1
4    ribosomal protein S11 (RPS11), mRNA                                                       RPS11
5    ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA   oligomycin sensitivity conferring protein,ATP5O
6   cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA                     CMAS
7    heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA                                   HNRPF
8    ribosomal protein L34 (RPL34), transcript variant 1, mRNA                                 RPL34
9   ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA   subunit 9,ATP5G3
10  ribosomal protein S15a (RPS15A), mRNA                                                      RPS15A
11  homeobox protein from AL590526 (LOC84528), mRNA                                            LOC84528
12  clone MGC:10120 IMAGE:3900723, mRNA, complete cds                                          NA

Case 2 output

ID                                                                               Gene.Name   Symbol
1                      NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA   NDUFA4
2                   mitochondrial  S33 (MRPS33), transcript variant 1, nuclear gene, mRNA   MRPS33
3                                farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA   FDFT1
4                                                     ribosomal protein S11 (RPS11), mRNA   RPS11
5  ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA   ATP5O
6                  cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA   CMAS
7                                 heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA   HNRPF
8                               ribosomal protein L34 (RPL34), transcript variant 1, mRNA   RPL34
9 ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA   ATP5G3
10                                                   ribosomal protein S15a (RPS15A), mRNA  RPS15A
11                                         homeobox protein from AL590526 (LOC84528), mRNA  LOC84528
12                                       clone MGC:10120 IMAGE:3900723, mRNA, complete cds

akrun · Accepted Answer

An option using sub to get the words inside the round brackets at the end of the string.

 Symbol <- sub('.*$([^$]+)\)[^\(]+$', '\1',df1[,2])
 df1$Symbol <- Symbol[1:nrow(df1)*NA^(!grepl('\(',df1[,2]))]
 df1$Symbol
 #[1] "NDUFA4"   "MRPS33"   "FDFT1"    "RPS11"    "ATP5O"    "CMAS"    
 #[7] "HNRPF"    "RPL34"    "ATP5G3"   "RPS15A"   "LOC84528" NA

Update

For the first case, ie. extract all characters within the round brackets and paste them together using ,, one option is rm_round from qdapRegex. The output of rm_round is a list. So we use lapply/sapply to loop through the list. Strings that have , inside are separated with grep and we paste the round brackets, and then paste the strings together with collapse=', '. A convenient wrapper function is toString.

 library(qdapRegex)
 df1$allSymbol <-  sapply(rm_round(df1[,2],extract=TRUE), function(x) {
                         indx <- grep(',', x)
                        x[indx] <-paste0("(", x[indx], ")")
                         toString(x)})

 is.na(df1$allSymbol) <- df1$allSymbol=='NA'
 df1[3:4]
 #                                          allSymbol   Symbol
 #1                   ubiquinone, (9kD, MLRQ), NDUFA4   NDUFA4
 #2                                            MRPS33   MRPS33
 #3                                             FDFT1    FDFT1
 #4                                             RPS11    RPS11
 #5  oligomycin sensitivity conferring protein, ATP5O    ATP5O
 #6                                              CMAS     CMAS
 #7                                             HNRPF    HNRPF
 #8                                             RPL34    RPL34
 #9                                 subunit 9, ATP5G3   ATP5G3
 #10                                           RPS15A   RPS15A
 #11                                         LOC84528 LOC84528
 #12

Remove string in parenthesis and add that as a new column

Answers (2)

Update

Related Questions