Carol
Carol

Reputation: 367

Splitting column of a data.frame in R using gsub

I have a data.frame called rbp that contains a single column like following:

 >rbp
          V1
    dd_smadV1_39992_0_1
    Protein: AGBT(Dm)
    Sequence Position
    234
    290
    567
    126
    Protein: ATF1(Dm)
    Sequence Position
    534
    890
    105
    34
    128
    301
    Protein: Pox(Dm)
    201
    875
    453
    *********************
    dd_smadv1_9_02
    Protein: foxc2(Mm)
    Sequence Position
    145
    987
    345
    907
    Protein: Lor(Hs)
    876
    512

I would like to discard the Sequence position and extract only the specific details like the names of the sequence and the corresponding protein names like following:

dd_smadV1_39992_0_1 AGBT(Dm);ATF1(Dm);Pox(Dm)
dd_smadv1_9_02 foxc2(Mm);Lor(Hs)  

I tried the following code in R but it failed:

library(gsubfn)
Sub(rbp$V1,"Protein:(.*?) ")

Could anyone guide me please.

Upvotes: 0

Views: 376

Answers (1)

lukeA
lukeA

Reputation: 54247

Here's one way to to it:

m <- gregexpr("Protein: (.*?)\n", x <- strsplit(paste(rbp$V1, collapse = "\n"), "*********************", fixed = TRUE)[[1]])
proteins <- lapply(regmatches(x, m), function(x) sub("Protein: (.*)\n", "\\1", x))
names <- sub(".*?([A-z0-9_]+)\n.*", "\\1", x)
sprintf("%s %s", names, sapply(proteins, paste, collapse = ";"))
# [1] "dd_smadV1_39992_0_1 AGBT(Dm);ATF1(Dm);Pox(Dm)"
# [2] "dd_smadv1_9_02 foxc2(Mm);Lor(Hs)

Upvotes: 1

Related Questions