Reputation: 1120
I edited my data and it looks like below:
Sequence modifications No_Ks No_Ks_modif diff
1 AAAAGAAAVANQGKK Acetyl Acetyl 2 2 0
2 AAIKFIKFINPKINDGE Acetyl Biotin Acetyl 3 3 0
3 AAIKFIKFINPKINDGE Acetyl Acetyl 3 2 1
4 IKKVGYNPKTVPFVPIS Acetyl Acetyl Acetyl Oxidation 3 4 -1
No_Ks -> total number of K in the sequence No_Ks_modif -> number of modified K by acetyl or biotin (should be only) but it counts a Oxidation aswell so that's why the number of K's modified if higher and a total number of them.
I used the code below to count the number of modified K (from the sequence):
# Count of modifications
dataset[, No_Ks_modif := 6]
dataset[V6 == "", No_Ks_modif := 5]
dataset[V5 == "", No_Ks_modif := 4]
dataset[V4 == "", No_Ks_modif := 3]
dataset[V3 == "", No_Ks_modif := 2]
dataset[V2 == "", No_Ks_modif := 1]
dataset[V1 == "", No_Ks_modif := 0]
# Retaining Acetyl/Biotin or no modification only
dataset[, AB01 := TRUE]
dataset[, AB02 := TRUE]
dataset[, AB03 := TRUE]
dataset[, AB04 := TRUE]
dataset[, AB05 := TRUE]
dataset[, AB06 := TRUE]
dataset[V1 != "", AB01 := grepl(V1, pattern = "Acetyl|Biotin|Oxidation")]
dataset[V2 != "", AB02 := grepl(V2, pattern = "Acetyl|Biotin|Oxidation")]
dataset[V3 != "", AB03 := grepl(V3, pattern = "Acetyl|Biotin|Oxidation")]
dataset[V4 != "", AB04 := grepl(V1, pattern = "Acetyl|Biotin|Oxidation")]
dataset[V5 != "", AB05 := grepl(V2, pattern = "Acetyl|Biotin|Oxidation")]
dataset[V6 != "", AB06 := grepl(V3, pattern = "Acetyl|Biotin|Oxidation")]
dataset <- dataset[AB01 & AB02 & AB03 & AB04 & AB05 & AB06]
If I remove the "Oxidation" from the code it won't count whole row and that's the problem.
I see two ways to do it. One way might be to only count Biotin and Acetyl as a modification, my script can't do it. Second way is to remove the "Oxidation" from all of the columns, neither don't know how to do that. Any suggestions are welcome.
Stupid question in the end. Is there any way to paste huge code in proper way without pressing 4 times space in all lines of the code?
Edit: Dataset before running whole code involved only 2 columns:
Sequence modifications
AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K)
AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)
And much more rows.
Upvotes: 0
Views: 56
Reputation: 15163
There are certainly easier ways to do this. Here is one example. First I'll reconstruct your dataset the way I think it is:
> df=read.table(text="Sequence modifications
+ AAAAGAAAVANQGKK '[14] Acetyl (K)|[15] Acetyl (K)'
+ AAIKFIKFINPKINDGE '[4] Acetyl (K)|[7] Acetyl (K)'",h=T,stringsAsFactors = F)
> dt<-data.table(df)
> dt
Sequence modifications
1: AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K)
2: AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)
Now you can use strsplit
to do your work:
dt[,no_Ks:=lapply(strsplit(Sequence,""),function(x) sum(x=="K"))]
dt[,no_Ks_modif:=lapply(strsplit(modifications," "),
function(x) sum(x %in% c("Acetyl","Biotin")))]
dt
## Sequence modifications no_Ks no_Ks_modif
## 1: AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K) 2 2
## 2: AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K) 3 2
Upvotes: 3