Lou_A
Lou_A

Reputation: 259

Extract top 10 values by group with groups on the y axis of a table

This question is very similar to Select the top N values by group but with a variation I can't seem to solve. I have a large frequency table with 3 columns and over 50 rows. Each row is independent and may the same value for a particular column. Example:

       CD     Freq    sample
 express 1    0.2        A
 express 2    0.7        A
 express 2    0.5        B
 express 3    0.1        B
 express 4    0.9        A
 express 4    0.3        C
 express 5    0.8        B
 express 5    0.1        C

I'd like to generate a "top 10" frequency table for each sample where the rows are consolidated by each CD, and the columns are each sample (A, B, C etc), and in the intersection the frequency. Example:

             A    B    C
express 1   0.2   0    0
express 2   0.7  0.5   0
express 3   0    0.1   0
express 4   0.3   0    0.9
express 5   0     0.8  0.1

I've generated the original table and can sort, but I don't know how to aggregate the frequencies and list them in the column by sample. But that's as far as I can wrap my newbie skills around. Any help is greatly appreciated.

data <- vdj.data %>% map_dfr(read_delim, "\t", escape_double = FALSE, trim_ws = TRUE, .id ="source") %>% mutate(source = str_replace(basename(source), ".txt", ""))
data <- vdj.data %>% select(freq, cd, "file"= source)


require(data.table)
d <- data.table(vdj.data, key="cd")

top <- d[, head(.SD, 10), by=cd]

Expected Output should display only the top ten express by frequency in each sample, so it may require that some express may have to be consolidated, for example, say I wanted to pick the top 2 instead of top 10. I would produce the table:

                        A       B       C
            express 2   0.7    0.5     0
            express 4   0.9     0       0.3
            express 5   0       0.8     0.1

So as you can see, express 1 and express 3 are dropped because no sample has it as its top 2 frequencies. The top 2 for A are express 2 and express 4, the top 2 for B are express 2 and express 5, and finally, the top 2 for sample C are expression 4 and 5.

I hope this helps clarify.

Upvotes: 2

Views: 130

Answers (2)

Cole
Cole

Reputation: 11255

Since you included the library, here's a similar route:

library(data.table)

setorder(dt, -Freq)

n = 2
dt[dt[, .I[1:n], by = sample]$V1,
   xtabs(Freq ~ CD + sample)]
#>           sample
#> CD           A   B   C
#>   express2 0.7 0.5 0.0
#>   express4 0.9 0.0 0.3
#>   express5 0.0 0.8 0.1

#or

dcast(data = dt[dt[, .I[1:n], by = sample]$V1],
      CD ~ sample,
      value.var = 'Freq',
      fill = 0L)
#>          CD   A   B   C
#> 1: express2 0.7 0.5 0.0
#> 2: express4 0.9 0.0 0.3
#> 3: express5 0.0 0.8 0.1

Created on 2019-09-18 by the reprex package (v0.3.0)

Data:

dt <- fread(
'CD     Freq    sample
express1    0.2        A
express2    0.7        A
express2    0.5        B
express3    0.1        B
express4    0.9        A
express4    0.3        C
express5    0.8        B
express5    0.1        C')

Upvotes: 2

Gregor Thomas
Gregor Thomas

Reputation: 145765

Using dplyr and reshape2, we get the top N by group and then convert long to wide. Feel free to use your favorite long to wide method from the FAQ

n = 2
df %>% group_by(sample) %>%
  arrange(desc(Freq)) %>% 
  slice(1:n) %>% 
  reshape2::dcast(CD ~ sample, value.var = "Freq", fill = 0)
#          CD   A   B   C
# 1 express 2 0.7 0.5 0.0
# 2 express 4 0.9 0.0 0.3
# 3 express 5 0.0 0.8 0.1

Keeping top n anywhere:

df %>% group_by(sample) %>%
  arrange(desc(Freq)) %>%
  slice(1:n) %>%
  ungroup %>%
  select(CD) %>% 
  inner_join(df) %>%
  reshape2::dcast(CD ~ sample, value.var = "Freq", fill = 0,
                  fun.aggregate = mean)

Using this data:

df = read.table(text = "       CD     Freq    sample
 'express 1'  0.2        A
 'express 2'    0.7        A
 'express 2'  0.5        B
 'express 3'    0.1        B
 'express 4'  0.9        A
 'express 4'    0.3        C
 'express 5'  0.8        B
 'express 5'    0.1        C", header = T)

Upvotes: 2

Related Questions