Reputation: 585
I have a large data frame that Im working with, the first few lines are as follows:
Assay Genotype Sample Result
1 001 G 1 0
2 001 A 2 1
3 001 G 3 0
4 001 NA 4 NA
5 002 T 1 0
6 002 G 2 1
7 002 T 3 0
8 002 T 4 0
9 003 NA 1 N
10 003 G 2 1
11 003 G 3 1
12 003 T 4 0
In total I'll be working with 2000 samples and 168 Assays for each sample. For each sample, Id like extract the data in 'Result' for each sample to create either a list or data frame that looks something like this:
Sample Data
1 00N
2 111
3 001
4 N00
The resulting data frame (or similar preferred data structure) would thus be 2000 rows and 2 columns. The 'Data' line would contain 168 characters each one for each 'Assay'.
Can somebody help me with this problem?
Upvotes: 2
Views: 598
Reputation: 69251
One approach with package plyr
and base function paste
:
library(plyr)
ddply(dat, "Sample", summarize, Data = paste(Result, collapse = ""))
Sample Data
1 1 00N
2 2 111
3 3 001
4 4 NA00
EDIT to address question
Probably the easiest way I can think of to change your NA to N is to use gsub
on the result of ddply
. Note I'm liberally borrowing the very good point provided by @Brian re: ordering. Do that, it's a good tip!
out <- ddply(dat, "Sample", summarize, Data = paste(Result[order(Assay)], collapse = ""))
Then use gsub
out$Data <- gsub("NA", "N", out$Data)
et voila:
Sample Data
1 1 00N
2 2 111
3 3 001
4 4 N00
Upvotes: 3
Reputation: 58875
Note that @Chase and @Andrie both assume that the data is already sorted by assay (which your example is, so not an unreasonable assumption). If it is not, you can still get the string in the proper order.
Adapting @Chase's solution
library(plyr)
ddply(dat, "Sample", summarize,
Data = paste(Result[order(Assay)], collapse = ""))
gives
Sample Data
1 1 00N
2 2 111
3 3 001
4 4 NA00
If we use data which is not sorted:
dat.scramble <- dat[sample(nrow(dat)),]
> dat.scramble
Assay Genotype Sample Result
6 002 G 2 1
1 001 G 1 0
3 001 G 3 0
7 002 T 3 0
10 003 G 2 1
8 002 T 4 0
12 003 T 4 0
5 002 T 1 0
2 001 A 2 1
4 001 NA 4 NA
9 003 NA 1 N
11 003 G 3 1
we still get the same result
ddply(dat.scramble, "Sample", summarize,
Data = paste(Result[order(Assay)], collapse = ""))
Sample Data
1 1 00N
2 2 111
3 3 001
4 4 NA00
Upvotes: 1
Reputation: 179578
Base R solution using split
and sapply
:
sapply(split(dat$Result, dat$Sample), paste, collapse="")
1 2 3 4
"00N" "111" "001" "NA00"
Upvotes: 3