Adding a new column based on random values in other column for multiple files simultaneously in R

Question

There are so many post regarding the same issue. But I could not find solution for this. I have 140 individual files. Each files has the same columns as follows.

    taxy      taxx     
1  Alistipes  Roseburia         
2  Alistipes  Clostridium     
3  Alistipes  Clostridium
4  Dorea       Alistipes 
5 Clostridium  Alistipes 
6  Roseburia   Dora

I need to create two new columns (otuno) based on taxy and taxx respectively. For example,In taxy, more than 70 values are there. It is impossible to assign numbers for each values separately. My desire output looks like

taxy      taxx            taxyid   taxxid 
1  Alistipes  Roseburia     1   2
2  Alistipes  Clostridium   1   3
3  Dorea       Alistipes    4   1
4 Clostridium  Alistipes    3   1
5  Roseburia   Dorea        2   4

How do I perform this for 140 files together? All the files are in csv format

Ronak Shah · Accepted Answer

We can unlist and gather all the unique values from the dataframe and add factor levels for each column based on that and convert it into integer.

unique_levels <- unique(unlist(df))

df[paste0(names(df), "_id")] <- lapply(df, function(x) 
                                as.integer(factor(x, levels = unique_levels)))
df

#         taxy        taxx taxy_id taxx_id
#1   Alistipes   Roseburia       1       4
#2   Alistipes Clostridium       1       3
#3   Alistipes Clostridium       1       3
#4       Dorea   Alistipes       2       1
#5 Clostridium   Alistipes       3       1
#6   Roseburia       Dorea       4       2

In dplyr, we can use mutate_all

library(dplyr)
df %>% mutate_all(list(id = ~as.integer(factor(., levels = unique_levels))))

To apply this to multiple files, we can put the above code in a function

all_files <- list.files("path/of/files", full.names = TRUE)
cols <- c("taxx", "taxy")

apply_fun <- function(df) {
  unique_levels <- unique(unlist(df))
  df[paste0(cols, "_id")] <- lapply(df[cols], function(x) as.integer(factor(x, levels = unique_levels)))
  return(df)
}

and apply the function to each file through lapply by reading the files using read.csv or any other method that you used.

lapply(seq_along(all_files), function(x) 
  write.csv(apply_fun(read.csv(all_files[x])), basename(all_files[x]), row.names = FALSE))

data

df <- structure(list(taxy = structure(c(1L, 1L, 1L, 3L, 2L, 4L), 
.Label = c("Alistipes", "Clostridium", "Dorea", "Roseburia"), class = "factor"), 
taxx = structure(c(4L, 2L, 2L, 1L, 1L, 3L), .Label = c("Alistipes", "Clostridium", 
"Dorea", "Roseburia"), class = "factor")), class = "data.frame", 
row.names = c("1", "2", "3", "4", "5", "6"))

Adding a new column based on random values in other column for multiple files simultaneously in R

Answers (1)

Related Questions