Reputation: 239

Collapse columns in a dataframe (R)

Basically, I have a dataframe, df

                  Beginning1 Protein2    Protein3    Protein4    Biomarker1
      Pathway3    A         G           NA           NA           F
      Pathway8    Z         G           NA           NA           E
      Pathway9    A         G           Z            H            F
      Pathway6    Y         G           Z            H            E
      Pathway2    A         G           D            NA           F
      Pathway5    Q         G           D            NA           E
      Pathway1    A         D           K            NA           F
      Pathway7    A         B           C            D            F
      Pathway4    V         B           C            D            E

And I want to combine the dataframe so that those rows when are identical from "Protein2" to "Protein4" are condense, giving the following:

            Beginning1 Protein2     Protein3     Protein4     Biomarker1
Pathway3    A,Z         G           NA           NA           F,E
Pathway9    A,Y         G           Z            H            F,E
Pathway2    A,Q         G           D            NA           F,E
Pathway1    A           D           K            NA           F
Pathway7    A,V         B           C            D            F,E

This is very similar to a question that I asked before (Consolidating duplicate rows in a dataframe), however the difference is that I am also consolidating the "Beginning1" row.

So far, I have tried:

library(dat.table)
dat<-data.table(df)

Total_collapse <- dat[, .(
Biomarker1 = paste0(Biomarker1, collapse = ", ")),
by = .(Beginning1, Protein1, Protein2, Protein3)]

Total_collapse <- dat[, .(
Beginning1 = paste0(Beginning1, collapse = ", ")),
by = .(Protein1, Protein2, Protein3)]

which gives the output:

            Beginning1  Protein2    Protein3      Protein4      Biomarker1
Pathway3    G           NA           NA           F,E
Pathway9    G           Z            H            F,E
Pathway2    G           D            NA           F,E
Pathway1    D           K            NA           F
Pathway7    B           C            D            F,E

Does anyone know how to fix this problem? I have also tried duplicating the solution from Collapse / concatenate / aggregate a column to a single comma separated string within each group, but have had no success.

I am sorry if it is a simple error- I am pretty new to R.

Upvotes: 6

Answers (3)

akrun

Reputation: 887901

We can use aggregate from base R

r1 <- aggregate(cbind(Beginning1, Biomarker1)~., replace(df,is.na(df), "NA"), FUN = toString)
r1
#    Protein2 Protein3 Protein4 Beginning1 Biomarker1
#1        B        C        D       A, V       F, E
#2        G        Z        H       A, Y       F, E
#3        G        D       NA       A, Q       F, E
#4        D        K       NA          A          F
#5        G       NA       NA       A, Z       F, E
r1[r1=="NA"] <- NA

Upvotes: 1

Adam Spannbauer

Reputation: 2757

Using data.table you can use .SD to refer to all columns not specified in the by argument. Then we can use lapply to accomplish the paste() with collapse.

library(data.table)
dt <- read.table(text = "Beginning1 Protein2    Protein3    Biomarker1
                  A         G           NA           NA           F
                  Z         G           NA           NA           E
                  A         G           Z            H            F
                  Y         G           Z            H            E
                  A         G           D            NA           F
                  Q         G           D            NA           E
                  A         D           K            NA           F
                  A         B           C            D            F
                  V         B           C            D            E",header = T)
dt <- data.table(dt)
dt[,lapply(.SD, function(col) paste(col, collapse=", ")), 
    by=.(Protein2, Protein3, Protein4)]

Output

   Protein2 Protein3 Protein4 Beginning1 Biomarker1
1:        G       NA       NA       A, Z       F, E
2:        G        Z        H       A, Y       F, E
3:        G        D       NA       A, Q       F, E
4:        D        K       NA          A          F
5:        B        C        D       A, V       F, E

Upvotes: 3

MrFlick

Reputation: 206566

Here's a possible solution using dplyr

df %>% group_by_at(vars(Protein2:Protein4)) %>%
  summarize_all(paste, collapse=",")

Upvotes: 7

Collapse columns in a dataframe (R)

Answers (3)

Related Questions