Camila
Camila

Reputation: 89

How do I bind/combine selected columns from multiple dataframes into one

I have many data.frames (448) with the same columns names (9 in total), like this:

       V1              V2      V4    ... V9
ENSG00000000003.15   TSPAN6   7095
ENSG00000000005.6     TNMD    4355
       .                .       .
       .                .       .

i want to create another data.frame, maintaining the first 2 columns (V1 and V2), these are identical in every data.frame, and merge V4 column (these are different for every data.frame) from all data.frames, the rest of the columns should be excluded.

If it could be possible, I would like to rename the V4 columns to "sample1", "sample2" and so on, until 448.

So the final dataframe should be:

       V1              V2     V4_1    V4_2 ... V4_448
ENSG00000000003.15   TSPAN6   7095    3856       .
ENSG00000000005.6     TNMD    4355    2976       .
       .                .       .      .         . 
       .                .       .      .         .

I already did this:

reader <- function(f){
  read.table(f, sep='\t', skip=6, header=FALSE)
}

files <- list.files(path, 
                    recursive=TRUE, full.names=TRUE)

myfilelist <- lapply(files, reader)

But I don't know how to combine only these selected columns

This is the output of dput(lapply(myfilelist[1:2], head)):

myfilelist <- list(structure(list(V1 = c("ENSG00000000003.15", "ENSG00000000005.6", 
"ENSG00000000419.13", "ENSG00000000457.14", "ENSG00000000460.17", 
"ENSG00000000938.13"), V2 = c("TSPAN6", "TNMD", "DPM1", "SCYL3", 
"C1orf112", "FGR"), V3 = c("protein_coding", "protein_coding", 
"protein_coding", "protein_coding", "protein_coding", "protein_coding"
), V4 = c(7094L, 2L, 4355L, 1149L, 372L, 585L), V5 = c(3573L, 
1L, 2201L, 953L, 553L, 281L), V6 = c(3521L, 1L, 2154L, 883L, 
579L, 308L), V7 = c(59.9764, 0.052, 138.3704, 6.4018, 2.3896, 
6.6335), V8 = c(20.5827, 0.0178, 47.4859, 2.197, 0.8201, 2.2765
), V9 = c(22.2037, 0.0192, 51.2256, 2.37, 0.8847, 2.4558)), row.names = c(NA, 
6L), class = "data.frame"), structure(list(V1 = c("ENSG00000000003.15", 
"ENSG00000000005.6", "ENSG00000000419.13", "ENSG00000000457.14", 
"ENSG00000000460.17", "ENSG00000000938.13"), V2 = c("TSPAN6", 
"TNMD", "DPM1", "SCYL3", "C1orf112", "FGR"), V3 = c("protein_coding", 
"protein_coding", "protein_coding", "protein_coding", "protein_coding", 
"protein_coding"), V4 = c(2616L, 23L, 3746L, 1288L, 510L, 1578L
), V5 = c(1369L, 9L, 1876L, 1015L, 681L, 797L), V6 = c(1250L, 
14L, 1871L, 984L, 693L, 782L), V7 = c(16.8063, 0.4541, 90.4417, 
5.4531, 2.4895, 13.5969), V8 = c(4.8615, 0.1314, 26.1617, 1.5774, 
0.7201, 3.9331), V9 = c(6.0158, 0.1625, 32.3733, 1.9519, 0.8911, 
4.867)), row.names = c(NA, 6L), class = "data.frame"))

Upvotes: 1

Views: 39

Answers (2)

AkselA
AkselA

Reputation: 8846

We can do this quite easily in base R as well

bound <- data.frame(myfilelist[[1]][1:2], do.call(cbind, lapply(myfilelist, "[", 4)))
colnames(bound)[-(1:2)] <- paste0("sample", seq(length(myfilelist)))
bound
#                   V1       V2 sample1 sample2
# 1 ENSG00000000003.15   TSPAN6    7094    2616
# 2  ENSG00000000005.6     TNMD       2      23
# 3 ENSG00000000419.13     DPM1    4355    3746
# 4 ENSG00000000457.14    SCYL3    1149    1288
# 5 ENSG00000000460.17 C1orf112     372     510
# 6 ENSG00000000938.13      FGR     585    1578

Upvotes: 0

LMc
LMc

Reputation: 18722

It's not clear to me how you want to join this list of data frames (are the first two columns identical?), but here is an option using left-join:

library(dplyr)
library(purrr)

imap(myfilelist, \(df, i) select(df, 1:2, "sample{i}" := 4) ) |>
  reduce(left_join, by = join_by(V1, V2))

If the first two columns are identical across all data frames then you could bind them together:

library(dplyr)
library(purrr)

bind_cols(myfilelist[[1]][1:2],
          imap(myfilelist, \(df, i) select(df, "sample{i}" := 4)) |> bind_cols())

Upvotes: 1

Related Questions