Taylor Maurer
Taylor Maurer

Reputation: 239

Collapse columns in a dataframe (R)

Basically, I have a dataframe, df

                  Beginning1 Protein2    Protein3    Protein4    Biomarker1
      Pathway3    A         G           NA           NA           F
      Pathway8    Z         G           NA           NA           E
      Pathway9    A         G           Z            H            F
      Pathway6    Y         G           Z            H            E
      Pathway2    A         G           D            NA           F
      Pathway5    Q         G           D            NA           E
      Pathway1    A         D           K            NA           F
      Pathway7    A         B           C            D            F
      Pathway4    V         B           C            D            E

And I want to combine the dataframe so that those rows when are identical from "Protein2" to "Protein4" are condense, giving the following:

            Beginning1 Protein2     Protein3     Protein4     Biomarker1
Pathway3    A,Z         G           NA           NA           F,E
Pathway9    A,Y         G           Z            H            F,E
Pathway2    A,Q         G           D            NA           F,E
Pathway1    A           D           K            NA           F
Pathway7    A,V         B           C            D            F,E

This is very similar to a question that I asked before (Consolidating duplicate rows in a dataframe), however the difference is that I am also consolidating the "Beginning1" row.

So far, I have tried:

library(dat.table)
dat<-data.table(df)

Total_collapse <- dat[, .(
Biomarker1 = paste0(Biomarker1, collapse = ", ")),
by = .(Beginning1, Protein1, Protein2, Protein3)]

Total_collapse <- dat[, .(
Beginning1 = paste0(Beginning1, collapse = ", ")),
by = .(Protein1, Protein2, Protein3)]

which gives the output:

            Beginning1  Protein2    Protein3      Protein4      Biomarker1
Pathway3    G           NA           NA           F,E
Pathway9    G           Z            H            F,E
Pathway2    G           D            NA           F,E
Pathway1    D           K            NA           F
Pathway7    B           C            D            F,E

Does anyone know how to fix this problem? I have also tried duplicating the solution from Collapse / concatenate / aggregate a column to a single comma separated string within each group, but have had no success.

I am sorry if it is a simple error- I am pretty new to R.

Upvotes: 6

Views: 12589

Answers (3)

akrun
akrun

Reputation: 886938

We can use aggregate from base R

r1 <- aggregate(cbind(Beginning1, Biomarker1)~., replace(df,is.na(df), "NA"), FUN = toString)
r1
#    Protein2 Protein3 Protein4 Beginning1 Biomarker1
#1        B        C        D       A, V       F, E
#2        G        Z        H       A, Y       F, E
#3        G        D       NA       A, Q       F, E
#4        D        K       NA          A          F
#5        G       NA       NA       A, Z       F, E
r1[r1=="NA"] <- NA

Upvotes: 1

Adam Spannbauer
Adam Spannbauer

Reputation: 2747

Using data.table you can use .SD to refer to all columns not specified in the by argument. Then we can use lapply to accomplish the paste() with collapse.

library(data.table)
dt <- read.table(text = "Beginning1 Protein2    Protein3    Biomarker1
                  A         G           NA           NA           F
                  Z         G           NA           NA           E
                  A         G           Z            H            F
                  Y         G           Z            H            E
                  A         G           D            NA           F
                  Q         G           D            NA           E
                  A         D           K            NA           F
                  A         B           C            D            F
                  V         B           C            D            E",header = T)
dt <- data.table(dt)
dt[,lapply(.SD, function(col) paste(col, collapse=", ")), 
    by=.(Protein2, Protein3, Protein4)]

Output

   Protein2 Protein3 Protein4 Beginning1 Biomarker1
1:        G       NA       NA       A, Z       F, E
2:        G        Z        H       A, Y       F, E
3:        G        D       NA       A, Q       F, E
4:        D        K       NA          A          F
5:        B        C        D       A, V       F, E

Upvotes: 3

MrFlick
MrFlick

Reputation: 206167

Here's a possible solution using dplyr

df %>% group_by_at(vars(Protein2:Protein4)) %>%
  summarize_all(paste, collapse=",")

Upvotes: 7

Related Questions