David
David

Reputation: 19

R data arrangement for metagenomic data

I attach you an example with my attempts because I am not able to manage / arrange data through R code. I have a datraframe that first column is the taxonomic lineage of microorganisms. And each column is a DNA sequence recodified by ASV1 and so on.

For each column, only some of its values will have value ==1. The rest will be 0.

I attach below the code to be reproducible. The RData to load the dataframe file is freely-available on: https://www.jottacloud.com/s/191545e30dc99e14823959fadba6d189be5


data<-read_xlsx("combined_allranks_mpa.xlsx")


datastackoverchange <- data
datastackoverchange <- as.data.frame(datastackoverchange)


names(datastackoverchange)[2:3812] <- sprintf("ASV_%d",seq(1:3811))

save.image("stackoverflow_data.RData")

# I perform a subset of the first two columns

data1<-datastackoverchange[ , c(1,2)]

# Each column has a plenty of zeros except for the lineage that correspond. 

# I remove all zeroes that are not of interest by:
data1[data1==0] <- NA
data1<-data1[complete.cases(data1),]

And I obtain the next table (see the link of the image)

[The column ASV1 have 4 rows of value "1" because each "1" value arrives to a specific lineage rank] (https://i.sstatic.net/OZi9W.jpg)

In the first example (subset c(1,2) I have that the most complete ASV1 (most length) it is k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales. Usually, the longest ASV lineage it will appear in the last position in the dataframe.

Nevertheless, from this step I would like to create maybe from an empty datafame or list that copies me for example:

Column A Column B
ASV1 k__Bacteria/p__Firmicutes/c__Clostridia/o__Clostridiales
ASV2 and so on

The "/" are "|" in the dataframe.

and so on for each column (ASV2, ASV3...) creating a loop to iterize it

In order to exploit the data (I have 3811 different ASV) for further analysis.

Thanks on advance for your hints and helps about how can I overcome this situation.

Upvotes: 0

Views: 79

Answers (3)

Mag&#237;BC
Mag&#237;BC

Reputation: 77

Continuing my issue and for stackoverflow issue (Extract Row and Column Name if the value for the cell in the data frame is greater than 0 and save value and row and column name to empty data frame) I achieved to advance:

Here's the code from the RData submitted in my previous comment in this page:

load("stackoverflow_data.RData")
datastackoverchange <-as.data.frame(datastackoverchange)

library(tidyverse)

dat_clean_def <- datastackoverchange %>% 
                        remove_rownames %>%
                        column_to_rownames(var="Classification") 

idx <- which(dat_clean_def == "1", arr.ind=TRUE) 
results <- data.frame(Row=rownames(dat_clean_def)[idx[, 1]],
                      Col=colnames(dat_clean_def)[idx[, 2]],
                      Val=dat_clean_def[idx])
results

Nevertheless, I need to retain only the logest lineage, e.g.:

Row Column Value
k__Bacteria ASV_1 1
k__Bacteria;p__Firmicutes ASV_1 1
k__Bacteria;p__Firmicutes;c__Clostridia ASV_1 1
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales ASV_1 1
k__Bacteria ASV_2 1

Then I am seeking for a function that for each different column value chooses the lagest value of the Row column (with highest number of "_").

Using stringr() ?

Thanks another time

Upvotes: 0

Camillionnaire
Camillionnaire

Reputation: 198

You can either change the function in the first line of my first answer by :

function(x){                                 
  data_1 <- datastackoverchange$Classification[which(x==1)] 
  id_max <- which.max(str_count(datastackoverchange$Classification[which(x==1)], "_"))
return(data_1[id_max])
}

OR

In the continuity of the code you wrote you can try this :

library(stringr)
results %>% group_by(Col) %>% filter(Row == Row[which.max(str_count(Row,"_"))])

Upvotes: 1

Camillionnaire
Camillionnaire

Reputation: 198

Try this :

values <- apply(datastackoverchange[,2:ncol(datastackoverchange)],2,FUN = function(x)datastackoverchange$Classification[which(x==1) %>% dplyr::last()])

id <- colnames(datastackoverchange[,2:ncol(datastackoverchange)])

df <- data.frame(id, values)

Upvotes: 1

Related Questions