Seymoo
Seymoo

Reputation: 189

reshape a data.frame based on similar value in one column

I have a data.frame with 2 columns, where values in second column are repeated. for example:

     HUGO                     Cell
1    CD28                 T cells
2    CD3D                 T cells
3    CD3G                 T cells
4    CD8A                lymphocytes
5    EOMES               lymphocytes
6    FGFBP2              lymphocytes
7    GNLY                lymphocytes
8    NCR1                 NK cells
9    PTGDR                NK cells
10   SH2D1B               NK cells

I want all values in column HUGO corresponding to a unique name in column cell get into a names list after each unique name.

for example

T cells: CD28     CC3D     C34
lymphocytes: CD8A    EOMES    FGFBP2  FGFBP2    GNLY 
... 

I have tried
reshape(data.frame, timevar = "HUGO",idvar = "Cell",direction = "wide") but it just returns number of values for each name in Cell column.

Upvotes: 0

Views: 544

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 269586

Here are some possibilities depending on what it is you want. The first 5 use no packages.

1) aggregate/c This gives a data frame whose second column is a character vector of HUGO names.

aggregate(HUGO ~ Cell, DF, c)

giving:

         Cell                      HUGO
1 lymphocytes CD8A, EOMES, FGFBP2, GNLY
2    NK cells       NCR1, PTGDR, SH2D1B
3     T cells          CD28, CD3D, CD3G

2) aggregate/toString This gives a data frame whose second column contains character strings separating the HUGO names with comma.

aggregate(HUGO ~ Cell, DF, toString)

giving:

         Cell                      HUGO
1 lymphocytes CD8A, EOMES, FGFBP2, GNLY
2    NK cells       NCR1, PTGDR, SH2D1B
3     T cells          CD28, CD3D, CD3G

3) unstack This gives a list, one component per Cell, whose components are each the HUGO names of that Cell.

unstack(DF)

giving:

$lymphocytes
[1] "CD8A"   "EOMES"  "FGFBP2" "GNLY"  

$`NK cells`
[1] "NCR1"   "PTGDR"  "SH2D1B"

$`T cells`
[1] "CD28" "CD3D" "CD3G"

4) tapply This gives a matrix whose rows are Cells and whose columns are the ordinal number of the HUGO name.

DF2 <- transform(DF, seq = ave(seq_along(HUGO), Cell, FUN t= seq_along))
tapply(DF2$HUGO, DF2[-1], c)

giving:

             seq
Cell          1      2       3        4     
  lymphocytes "CD8A" "EOMES" "FGFBP2" "GNLY"
  NK cells    "NCR1" "PTGDR" "SH2D1B" NA    
  T cells     "CD28" "CD3D"  "CD3G"   NA   

5) reshape This uses DF2 from the last alternative together with reshape to give a data frame:

reshape(DF2, timevar = "seq", idvar = "Cell", dir = "wide")

giving:

         Cell HUGO.1 HUGO.2 HUGO.3 HUGO.4
1     T cells   CD28   CD3D   CD3G   <NA>
4 lymphocytes   CD8A  EOMES FGFBP2   GNLY
8    NK cells   NCR1  PTGDR SH2D1B   <NA>

6) spread This gives a "tbl_df" class object as output (which is a subclass of "data.frame")

library(dplyr)
library(tidyr)

DF %>% 
   group_by(Cell) %>%
   mutate(seq = 1:n()) %>%
   ungroup() %>%
   spread(seq, HUGO)

giving:

         Cell    1     2      3    4
1 lymphocytes CD8A EOMES FGFBP2 GNLY
2    NK cells NCR1 PTGDR SH2D1B <NA>
3     T cells CD28  CD3D   CD3G <NA>

7) read.zoo read.zoo gives a zoo object whose times are the Cells.
Since the times are actually character strings we use FUN=identity to avoid interpretation. fortify.zoo converts that to a data frame. DF2 is from above.

library(zoo)

fortify.zoo(read.zoo(DF2, split = "seq", index = "Cell", FUN = identity))

giving:

       Index    1     2      3    4
1 lymphocytes CD8A EOMES FGFBP2 GNLY
2    NK cells NCR1 PTGDR SH2D1B <NA>
3     T cells CD28  CD3D   CD3G <NA>

8) dcast This gives a data.table as output.

library(data.table)

DT <- data.table(DF)
DT[, seq:=1:.N, by = Cell]
dcast(DT, Cell ~ seq, value.var = "HUGO")

giving:

          Cell    1     2      3    4
1:    NK cells NCR1 PTGDR SH2D1B   NA
2:     T cells CD28  CD3D   CD3G   NA
3: lymphocytes CD8A EOMES FGFBP2 GNLY

Note:

DF <- structure(list(HUGO = c("CD28", "CD3D", "CD3G", "CD8A", "EOMES", 
"FGFBP2", "GNLY", "NCR1", "PTGDR", "SH2D1B"), Cell = c("T cells", 
"T cells", "T cells", "lymphocytes", "lymphocytes", "lymphocytes", 
"lymphocytes", "NK cells", "NK cells", "NK cells")), .Names = c("HUGO", 
"Cell"), class = "data.frame", row.names = c(NA, -10L))

Upvotes: 3

Related Questions