Sam Globus
Sam Globus

Reputation: 585

Extracting data (or reshaping) a data frame from an existing data frame in R

I have a large data frame that Im working with, the first few lines are as follows:

      Assay   Genotype   Sample    Result
1     001        G         1         0
2     001        A         2         1
3     001        G         3         0 
4     001        NA        4         NA
5     002        T         1         0
6     002        G         2         1
7     002        T         3         0 
8     002        T         4         0
9     003        NA        1         N
10    003        G         2         1
11    003        G         3         1 
12    003        T         4         0

In total I'll be working with 2000 samples and 168 Assays for each sample. For each sample, Id like extract the data in 'Result' for each sample to create either a list or data frame that looks something like this:

Sample  Data
   1    00N
   2    111
   3    001
   4    N00

The resulting data frame (or similar preferred data structure) would thus be 2000 rows and 2 columns. The 'Data' line would contain 168 characters each one for each 'Assay'.

Can somebody help me with this problem?

Upvotes: 2

Views: 598

Answers (3)

Chase
Chase

Reputation: 69251

One approach with package plyr and base function paste:

library(plyr)
ddply(dat, "Sample", summarize, Data = paste(Result, collapse = ""))

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4 NA00

EDIT to address question

Probably the easiest way I can think of to change your NA to N is to use gsub on the result of ddply. Note I'm liberally borrowing the very good point provided by @Brian re: ordering. Do that, it's a good tip!

out <- ddply(dat, "Sample", summarize, Data = paste(Result[order(Assay)], collapse = ""))

Then use gsub

out$Data <- gsub("NA", "N", out$Data)

et voila:

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4  N00

Upvotes: 3

Brian Diggs
Brian Diggs

Reputation: 58875

Note that @Chase and @Andrie both assume that the data is already sorted by assay (which your example is, so not an unreasonable assumption). If it is not, you can still get the string in the proper order.

Adapting @Chase's solution

library(plyr)
ddply(dat, "Sample", summarize, 
  Data = paste(Result[order(Assay)], collapse = ""))

gives

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4 NA00

If we use data which is not sorted:

dat.scramble <- dat[sample(nrow(dat)),]

> dat.scramble
   Assay Genotype Sample Result
6    002        G      2      1
1    001        G      1      0
3    001        G      3      0
7    002        T      3      0
10   003        G      2      1
8    002        T      4      0
12   003        T      4      0
5    002        T      1      0
2    001        A      2      1
4    001       NA      4     NA
9    003       NA      1      N
11   003        G      3      1

we still get the same result

ddply(dat.scramble, "Sample", summarize, 
  Data = paste(Result[order(Assay)], collapse = ""))

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4 NA00

Upvotes: 1

Andrie
Andrie

Reputation: 179578

Base R solution using split and sapply:

sapply(split(dat$Result, dat$Sample), paste, collapse="")

     1      2      3      4 
 "00N"  "111"  "001" "NA00" 

Upvotes: 3

Related Questions