Shaxi Liver
Shaxi Liver

Reputation: 1120

Removing the "word" from whole data or ignore it

I edited my data and it looks like below:

           Sequence       modifications                      No_Ks No_Ks_modif diff
1   AAAAGAAAVANQGKK       Acetyl Acetyl                        2           2    0
2 AAIKFIKFINPKINDGE       Acetyl Biotin Acetyl                 3           3    0
3 AAIKFIKFINPKINDGE       Acetyl Acetyl                        3           2    1
4 IKKVGYNPKTVPFVPIS       Acetyl Acetyl Acetyl Oxidation       3           4   -1

No_Ks -> total number of K in the sequence No_Ks_modif -> number of modified K by acetyl or biotin (should be only) but it counts a Oxidation aswell so that's why the number of K's modified if higher and a total number of them.

I used the code below to count the number of modified K (from the sequence):

# Count of modifications    
dataset[, No_Ks_modif := 6]
dataset[V6 == "", No_Ks_modif := 5]
dataset[V5 == "", No_Ks_modif := 4]
dataset[V4 == "", No_Ks_modif := 3]
dataset[V3 == "", No_Ks_modif := 2]
dataset[V2 == "", No_Ks_modif := 1]
dataset[V1 == "", No_Ks_modif := 0]

# Retaining Acetyl/Biotin or no modification only
dataset[, AB01 := TRUE]
dataset[, AB02 := TRUE]
dataset[, AB03 := TRUE]
dataset[, AB04 := TRUE]
dataset[, AB05 := TRUE]
dataset[, AB06 := TRUE]

dataset[V1 != "",  AB01 := grepl(V1, pattern = "Acetyl|Biotin|Oxidation")]
dataset[V2 != "",  AB02 := grepl(V2, pattern = "Acetyl|Biotin|Oxidation")]
dataset[V3 != "",  AB03 := grepl(V3, pattern = "Acetyl|Biotin|Oxidation")]
dataset[V4 != "",  AB04 := grepl(V1, pattern = "Acetyl|Biotin|Oxidation")]
dataset[V5 != "",  AB05 := grepl(V2, pattern = "Acetyl|Biotin|Oxidation")]
dataset[V6 != "",  AB06 := grepl(V3, pattern = "Acetyl|Biotin|Oxidation")]


dataset <- dataset[AB01 & AB02 & AB03 & AB04 & AB05 & AB06]

If I remove the "Oxidation" from the code it won't count whole row and that's the problem.

I see two ways to do it. One way might be to only count Biotin and Acetyl as a modification, my script can't do it. Second way is to remove the "Oxidation" from all of the columns, neither don't know how to do that. Any suggestions are welcome.

Stupid question in the end. Is there any way to paste huge code in proper way without pressing 4 times space in all lines of the code?

Edit: Dataset before running whole code involved only 2 columns:

Sequence                 modifications
AAAAGAAAVANQGKK     [14] Acetyl (K)|[15] Acetyl (K)
AAIKFIKFINPKINDGE   [4] Acetyl (K)|[7] Acetyl (K)

And much more rows.

Upvotes: 0

Views: 56

Answers (1)

mrip
mrip

Reputation: 15163

There are certainly easier ways to do this. Here is one example. First I'll reconstruct your dataset the way I think it is:

> df=read.table(text="Sequence                 modifications
+ AAAAGAAAVANQGKK     '[14] Acetyl (K)|[15] Acetyl (K)'
+ AAIKFIKFINPKINDGE   '[4] Acetyl (K)|[7] Acetyl (K)'",h=T,stringsAsFactors = F)
> dt<-data.table(df)
> dt
            Sequence                   modifications
1:   AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K)
2: AAIKFIKFINPKINDGE   [4] Acetyl (K)|[7] Acetyl (K)

Now you can use strsplit to do your work:

dt[,no_Ks:=lapply(strsplit(Sequence,""),function(x) sum(x=="K"))]
dt[,no_Ks_modif:=lapply(strsplit(modifications," "),
        function(x) sum(x %in% c("Acetyl","Biotin")))]
dt
##             Sequence                   modifications no_Ks no_Ks_modif
## 1:   AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K)     2           2
## 2: AAIKFIKFINPKINDGE   [4] Acetyl (K)|[7] Acetyl (K)     3           2

Upvotes: 3

Related Questions