Furqan
Furqan

Reputation: 69

Sub-setting or arrange the data in R

As I am new to R, this question may seem to you piece of a cake. I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms. For example:

  1. 0 org4|gene759
  2. 1 org1|gene992
  3. 2 org1|gene1101
  4. 3 org4|gene757
  5. 4 org1|gene1702
  6. 5 org1|gene989
  7. 6 org1|gene990
  8. 7 org1|gene1699
  9. 9 org1|gene1102
  10. 10 org4|gene2439
  11. 10 org1|gene1374

I need to re-arrange/reshape the data in following format.

Cluster No. Org 1 Org 2 org3 org4


  1. 0 0 0 1
  2. 1 0 0 0

I could not figure out how to do it in R. Thanks

Upvotes: 2

Views: 91

Answers (2)

Martin Smith
Martin Smith

Reputation: 4077

Reading the table into R can be done with

input <- read.table('filename.txt')

Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:

input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])

Our input data now looks like this:

> input
   V1            V2 V3
1   0  org4|gene759  4
2   1  org1|gene992  1
3   2 org1|gene1101  1
4   3  org4|gene757  4
5   4 org1|gene1702  1
6   5  org1|gene989  1
7   6  org1|gene990  1
8   7 org1|gene1699  1
9   9 org1|gene1102  1
10 10 org4|gene2439  4
11 10 org1|gene1374  1

Then we need to list the possible values of org:

possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)

Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.

result <- vapply(unique(input[, 1]), function (x) 
  possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))

We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:

result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])

I hope that this is what you were looking for -- it wasn't quite clear from your question!

Output:

> result

   org1 org2 org3 org4
0     0    0    0    1
1     1    0    0    0
2     1    0    0    0
3     0    0    0    1
4     1    0    0    0
5     1    0    0    0
6     1    0    0    0
7     1    0    0    0
9     1    0    0    0
10    1    0    0    1

Upvotes: 1

akrun
akrun

Reputation: 887058

We could use table

out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)), 
       factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))

head(out, 2)
#    ClusterNo org1 org2 org3 org4
#1         1    0    0    0    1
#2         2    1    0    0    0

It is also possible that we need to use the first column to get the frequency

out1 <- as.data.frame.matrix(table(df1[[1]], 
    factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))

Upvotes: 2

Related Questions