Reputation: 19
I attach you an example with my attempts because I am not able to manage / arrange data through R code. I have a datraframe that first column is the taxonomic lineage of microorganisms. And each column is a DNA sequence recodified by ASV1 and so on.
For each column, only some of its values will have value ==1. The rest will be 0.
I attach below the code to be reproducible. The RData to load the dataframe file is freely-available on: https://www.jottacloud.com/s/191545e30dc99e14823959fadba6d189be5
data<-read_xlsx("combined_allranks_mpa.xlsx")
datastackoverchange <- data
datastackoverchange <- as.data.frame(datastackoverchange)
names(datastackoverchange)[2:3812] <- sprintf("ASV_%d",seq(1:3811))
save.image("stackoverflow_data.RData")
# I perform a subset of the first two columns
data1<-datastackoverchange[ , c(1,2)]
# Each column has a plenty of zeros except for the lineage that correspond.
# I remove all zeroes that are not of interest by:
data1[data1==0] <- NA
data1<-data1[complete.cases(data1),]
And I obtain the next table (see the link of the image)
[The column ASV1 have 4 rows of value "1" because each "1" value arrives to a specific lineage rank] (https://i.sstatic.net/OZi9W.jpg)
In the first example (subset c(1,2) I have that the most complete ASV1 (most length) it is k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales. Usually, the longest ASV lineage it will appear in the last position in the dataframe.
Nevertheless, from this step I would like to create maybe from an empty datafame or list that copies me for example:
Column A | Column B |
---|---|
ASV1 | k__Bacteria/p__Firmicutes/c__Clostridia/o__Clostridiales |
ASV2 | and so on |
The "/" are "|" in the dataframe.
and so on for each column (ASV2, ASV3...) creating a loop to iterize it
In order to exploit the data (I have 3811 different ASV) for further analysis.
Thanks on advance for your hints and helps about how can I overcome this situation.
Upvotes: 0
Views: 79
Reputation: 77
Continuing my issue and for stackoverflow issue (Extract Row and Column Name if the value for the cell in the data frame is greater than 0 and save value and row and column name to empty data frame) I achieved to advance:
Here's the code from the RData submitted in my previous comment in this page:
load("stackoverflow_data.RData")
datastackoverchange <-as.data.frame(datastackoverchange)
library(tidyverse)
dat_clean_def <- datastackoverchange %>%
remove_rownames %>%
column_to_rownames(var="Classification")
idx <- which(dat_clean_def == "1", arr.ind=TRUE)
results <- data.frame(Row=rownames(dat_clean_def)[idx[, 1]],
Col=colnames(dat_clean_def)[idx[, 2]],
Val=dat_clean_def[idx])
results
Nevertheless, I need to retain only the logest lineage, e.g.:
Row | Column | Value |
---|---|---|
k__Bacteria | ASV_1 | 1 |
k__Bacteria;p__Firmicutes | ASV_1 | 1 |
k__Bacteria;p__Firmicutes;c__Clostridia | ASV_1 | 1 |
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales | ASV_1 | 1 |
k__Bacteria | ASV_2 | 1 |
Then I am seeking for a function that for each different column value chooses the lagest value of the Row column (with highest number of "_").
Using stringr() ?
Thanks another time
Upvotes: 0
Reputation: 198
You can either change the function in the first line of my first answer by :
function(x){
data_1 <- datastackoverchange$Classification[which(x==1)]
id_max <- which.max(str_count(datastackoverchange$Classification[which(x==1)], "_"))
return(data_1[id_max])
}
OR
In the continuity of the code you wrote you can try this :
library(stringr)
results %>% group_by(Col) %>% filter(Row == Row[which.max(str_count(Row,"_"))])
Upvotes: 1
Reputation: 198
Try this :
values <- apply(datastackoverchange[,2:ncol(datastackoverchange)],2,FUN = function(x)datastackoverchange$Classification[which(x==1) %>% dplyr::last()])
id <- colnames(datastackoverchange[,2:ncol(datastackoverchange)])
df <- data.frame(id, values)
Upvotes: 1