Reputation: 49
Let me explain this question with an example. I have three data frames:
df1: It is a big gigantic table which contains all the information.
df1 <- data.frame(Gene=c(1,2,3,4,5,6,7,8),
Description=c("ribonuclease HII", "glycerol-3-phosphate dehydrogenase", "Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)", "PAS domain S-box protein", "ribonuclease HII", "Isoleucyl-tRNA synthetase", "Succinyl-CoA ligase"),
Species=c("aa", "bb","aa","cc","ee","ff","aa","dd"),
Number1= c(1,0,3,20,99,100,31,123),
Number2 =c(1000, 12636,12,455,231,454,123,1), stringsAsFactors = FALSE)
> df1
Gene Description Species Number1 Number2
1 1 ribonuclease HII aa 1 1000
2 2 glycerol-3-phosphate dehydrogenase bb 0 12636
3 3 Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195 aa 3 12
4 4 Arginyl-tRNA synthetase (EC 6.1.1.19) cc 20 455
5 5 PAS domain S-box protein ee 99 231
6 6 ribonuclease HII ff 100 454
7 7 Isoleucyl-tRNA synthetase aa 31 123
8 8 Succinyl-CoA ligase dd 123 1
And df2 and df3 which are subsets of df1 after some grepl and regex functions:
df2 <- data.frame(Gene=c(1,2,3,4,5,6),
Description=c("ribonuclease HII", "glycerol-3-phosphate dehydrogenase", "glycerol-3-phosphate dehydrogenase", "Arginyl-tRNA synthetase (EC 6.1.1.19)", "PAS domain S-box protein", "glycerol-3-phosphate dehydrogenase"),
Species=c("aa", "bb","aa","cc","ee","ff"),
Number1= c(1,0,3,20,99,100),
Number2 =c(1000, 12636,12,455,231,454), stringsAsFactors = FALSE)
df3 <- data.frame(Gene=c(1,2,3,4,5,6),
Description=c("ribonuclease HII", "nitrite reductase large subunit", "Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Cytochrome cd1 nitrite reductase (EC:1.7.2.1)", "PAS domain S-box protein", "nitrite reductase large subunit"),
Species=c("aa", "bb","aa","cc","dd", "ff"),
Number1= c(1,0,3,20,99,100),
Number2 =c(1000, 12636,12,455,231,454), stringsAsFactors = FALSE)
> df2
Gene Description Species Number1 Number2
1 1 ribonuclease HII aa 1 1000
2 2 glycerol-3-phosphate dehydrogenase bb 0 12636
3 3 glycerol-3-phosphate dehydrogenase aa 3 12
4 4 Arginyl-tRNA synthetase (EC 6.1.1.19) cc 20 455
5 5 PAS domain S-box protein ee 99 231
6 6 glycerol-3-phosphate dehydrogenase ff 100 454
> df3
Gene Description Species Number1 Number2
1 1 ribonuclease HII aa 1 1000
2 2 nitrite reductase large subunit bb 0 12636
3 3 Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195 aa 3 12
4 4 Cytochrome cd1 nitrite reductase (EC:1.7.2.1) cc 20 455
5 5 PAS domain S-box protein dd 99 231
6 6 nitrite reductase large subunit ff 100 454
Summary of my question:
Here I would like to get all the species names from df1 having a certain "Description" name and search it in df2 and df3. If this specific Description name exists in both of the data, I want to return a table containing all the information of that species with a new column which writes "complete pathway" next to species passing this criterium. If It only exists in df2, It should write to the new column as incomplete pathway. If that species doesnt exist in both of the data, It should proceed to the next species and should write "No occurrences" to the newly produced column. At the end, I would like to a table with the newly produced information.
Here is what I have tried (I have selected a certain description in df2 and df3, namely as "glycerol-3-phosphate dehydrogenase" and "nitrite reductase large subunit", respectively):
for(i in unique(df1$Species)) {
x = subset(df2, Species == i & Description == "glycerol-3-phosphate dehydrogenase")
y = subset(df3, Species == i & Description == "nitrite reductase large subunit")
if (!is.na(x$Species) & !is.na(y$Species)){
print(i, "complete pathway")
}
else if(!is.na(x$Species) & is.na(y$Species)){
print(i, "incomplete pathway")
}
else if (is.na(x$Species) & is.na(y$Species)){next}
}
However It throws an error: Error in if (!is.na(x$Species) & !is.na(y$Species)) { : argument is of length zero
The expected output should be a new table (let's say df4):
df4 <- data.frame(Species=c("aa", "bb","cc","ee","ff", "dd"),
New.Table=c("Incomplete p.", "Complete p.","No occurences","No occurences","Incomplete p.", "No occurences"), stringsAsFactors = FALSE)
Species New.Table
1 aa Incomplete p.
2 bb Complete p.
3 cc No occurences
4 ee No occurences
5 ff Incomplete p.
6 dd No occurences
Thanks in advance. I am also open to your suggestions for the title and the edits in the text!.
Upvotes: 0
Views: 30
Reputation: 181
Since you have duplicates the function all()
allow me to check if every description in df1 are in df2 or df3.
This is a sample of the solution I came with tell me if this is what you expect
my_species <- unique(df1$Species)
my_data_species <- data.frame(Species = my_species, stringsAsFactors = FALSE)
my_function <- function(x) {
if (all(df1[which(df1$Species == my_species[x]), "Description"] %in% df2$Description) == TRUE & all(df1[which(df1$Species == my_species[x]), "Description"] %in% df3$Description) == TRUE) {
my_data_species[x, "New Table"] <<- "complete pathway"
} else if (all(df1[which(df1$Species == my_species[x]), "Description"] %in% df2$Description) == TRUE | all(df1[which(df1$Species == my_species[x]), "Description"] %in% df3$Description) == TRUE) {
my_data_species[x, "New Table"] <<- "incomplete pathway"
} else {
my_data_species[x, "New Table"] <<- "No occurences"
}
}
lapply(1:length(my_species), my_function)
Upvotes: 0