Reputation: 85
There are so many post regarding the same issue. But I could not find solution for this. I have 140 individual files. Each files has the same columns as follows.
taxy taxx
1 Alistipes Roseburia
2 Alistipes Clostridium
3 Alistipes Clostridium
4 Dorea Alistipes
5 Clostridium Alistipes
6 Roseburia Dora
I need to create two new columns (otuno) based on taxy and taxx respectively. For example,In taxy, more than 70 values are there. It is impossible to assign numbers for each values separately. My desire output looks like
taxy taxx taxyid taxxid
1 Alistipes Roseburia 1 2
2 Alistipes Clostridium 1 3
3 Dorea Alistipes 4 1
4 Clostridium Alistipes 3 1
5 Roseburia Dorea 2 4
How do I perform this for 140 files together? All the files are in csv format
Upvotes: 0
Views: 36
Reputation: 389055
We can unlist
and gather all the unique
values from the dataframe and add factor levels for each column based on that and convert it into integer.
unique_levels <- unique(unlist(df))
df[paste0(names(df), "_id")] <- lapply(df, function(x)
as.integer(factor(x, levels = unique_levels)))
df
# taxy taxx taxy_id taxx_id
#1 Alistipes Roseburia 1 4
#2 Alistipes Clostridium 1 3
#3 Alistipes Clostridium 1 3
#4 Dorea Alistipes 2 1
#5 Clostridium Alistipes 3 1
#6 Roseburia Dorea 4 2
In dplyr
, we can use mutate_all
library(dplyr)
df %>% mutate_all(list(id = ~as.integer(factor(., levels = unique_levels))))
To apply this to multiple files, we can put the above code in a function
all_files <- list.files("path/of/files", full.names = TRUE)
cols <- c("taxx", "taxy")
apply_fun <- function(df) {
unique_levels <- unique(unlist(df))
df[paste0(cols, "_id")] <- lapply(df[cols], function(x) as.integer(factor(x, levels = unique_levels)))
return(df)
}
and apply the function to each file through lapply
by reading the files using read.csv
or any other method that you used.
lapply(seq_along(all_files), function(x)
write.csv(apply_fun(read.csv(all_files[x])), basename(all_files[x]), row.names = FALSE))
data
df <- structure(list(taxy = structure(c(1L, 1L, 1L, 3L, 2L, 4L),
.Label = c("Alistipes", "Clostridium", "Dorea", "Roseburia"), class = "factor"),
taxx = structure(c(4L, 2L, 2L, 1L, 1L, 3L), .Label = c("Alistipes", "Clostridium",
"Dorea", "Roseburia"), class = "factor")), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6"))
Upvotes: 2