Reputation: 189
I have a data.frame
with 2 columns, where values in second column are repeated. for example:
HUGO Cell
1 CD28 T cells
2 CD3D T cells
3 CD3G T cells
4 CD8A lymphocytes
5 EOMES lymphocytes
6 FGFBP2 lymphocytes
7 GNLY lymphocytes
8 NCR1 NK cells
9 PTGDR NK cells
10 SH2D1B NK cells
I want all values in column HUGO corresponding to a unique name in column cell get into a names list after each unique name.
for example
T cells: CD28 CC3D C34
lymphocytes: CD8A EOMES FGFBP2 FGFBP2 GNLY
...
I have tried
reshape(data.frame, timevar = "HUGO",idvar = "Cell",direction = "wide")
but it just returns number of values for each name in Cell column.
Upvotes: 0
Views: 544
Reputation: 269586
Here are some possibilities depending on what it is you want. The first 5 use no packages.
1) aggregate/c This gives a data frame whose second column is a character vector of HUGO names.
aggregate(HUGO ~ Cell, DF, c)
giving:
Cell HUGO
1 lymphocytes CD8A, EOMES, FGFBP2, GNLY
2 NK cells NCR1, PTGDR, SH2D1B
3 T cells CD28, CD3D, CD3G
2) aggregate/toString This gives a data frame whose second column contains character strings separating the HUGO names with comma.
aggregate(HUGO ~ Cell, DF, toString)
giving:
Cell HUGO
1 lymphocytes CD8A, EOMES, FGFBP2, GNLY
2 NK cells NCR1, PTGDR, SH2D1B
3 T cells CD28, CD3D, CD3G
3) unstack This gives a list, one component per Cell, whose components are each the HUGO names of that Cell.
unstack(DF)
giving:
$lymphocytes
[1] "CD8A" "EOMES" "FGFBP2" "GNLY"
$`NK cells`
[1] "NCR1" "PTGDR" "SH2D1B"
$`T cells`
[1] "CD28" "CD3D" "CD3G"
4) tapply This gives a matrix whose rows are Cells and whose columns are the ordinal number of the HUGO name.
DF2 <- transform(DF, seq = ave(seq_along(HUGO), Cell, FUN t= seq_along))
tapply(DF2$HUGO, DF2[-1], c)
giving:
seq
Cell 1 2 3 4
lymphocytes "CD8A" "EOMES" "FGFBP2" "GNLY"
NK cells "NCR1" "PTGDR" "SH2D1B" NA
T cells "CD28" "CD3D" "CD3G" NA
5) reshape This uses DF2
from the last alternative together with reshape
to give a data frame:
reshape(DF2, timevar = "seq", idvar = "Cell", dir = "wide")
giving:
Cell HUGO.1 HUGO.2 HUGO.3 HUGO.4
1 T cells CD28 CD3D CD3G <NA>
4 lymphocytes CD8A EOMES FGFBP2 GNLY
8 NK cells NCR1 PTGDR SH2D1B <NA>
6) spread This gives a "tbl_df"
class object as output (which is a subclass of "data.frame"
)
library(dplyr)
library(tidyr)
DF %>%
group_by(Cell) %>%
mutate(seq = 1:n()) %>%
ungroup() %>%
spread(seq, HUGO)
giving:
Cell 1 2 3 4
1 lymphocytes CD8A EOMES FGFBP2 GNLY
2 NK cells NCR1 PTGDR SH2D1B <NA>
3 T cells CD28 CD3D CD3G <NA>
7) read.zoo read.zoo
gives a zoo object whose times are the Cells.
Since the times are actually character strings we use FUN=identity
to avoid interpretation. fortify.zoo
converts that to a data frame. DF2
is from above.
library(zoo)
fortify.zoo(read.zoo(DF2, split = "seq", index = "Cell", FUN = identity))
giving:
Index 1 2 3 4
1 lymphocytes CD8A EOMES FGFBP2 GNLY
2 NK cells NCR1 PTGDR SH2D1B <NA>
3 T cells CD28 CD3D CD3G <NA>
8) dcast This gives a data.table as output.
library(data.table)
DT <- data.table(DF)
DT[, seq:=1:.N, by = Cell]
dcast(DT, Cell ~ seq, value.var = "HUGO")
giving:
Cell 1 2 3 4
1: NK cells NCR1 PTGDR SH2D1B NA
2: T cells CD28 CD3D CD3G NA
3: lymphocytes CD8A EOMES FGFBP2 GNLY
Note:
DF <- structure(list(HUGO = c("CD28", "CD3D", "CD3G", "CD8A", "EOMES",
"FGFBP2", "GNLY", "NCR1", "PTGDR", "SH2D1B"), Cell = c("T cells",
"T cells", "T cells", "lymphocytes", "lymphocytes", "lymphocytes",
"lymphocytes", "NK cells", "NK cells", "NK cells")), .Names = c("HUGO",
"Cell"), class = "data.frame", row.names = c(NA, -10L))
Upvotes: 3