Extract top 10 values by group with groups on the y axis of a table

Question

This question is very similar to Select the top N values by group but with a variation I can't seem to solve. I have a large frequency table with 3 columns and over 50 rows. Each row is independent and may the same value for a particular column. Example:

       CD     Freq    sample
 express 1    0.2        A
 express 2    0.7        A
 express 2    0.5        B
 express 3    0.1        B
 express 4    0.9        A
 express 4    0.3        C
 express 5    0.8        B
 express 5    0.1        C

I'd like to generate a "top 10" frequency table for each sample where the rows are consolidated by each CD, and the columns are each sample (A, B, C etc), and in the intersection the frequency. Example:

             A    B    C
express 1   0.2   0    0
express 2   0.7  0.5   0
express 3   0    0.1   0
express 4   0.3   0    0.9
express 5   0     0.8  0.1

I've generated the original table and can sort, but I don't know how to aggregate the frequencies and list them in the column by sample. But that's as far as I can wrap my newbie skills around. Any help is greatly appreciated.

data <- vdj.data %>% map_dfr(read_delim, "	", escape_double = FALSE, trim_ws = TRUE, .id ="source") %>% mutate(source = str_replace(basename(source), ".txt", ""))
data <- vdj.data %>% select(freq, cd, "file"= source)


require(data.table)
d <- data.table(vdj.data, key="cd")

top <- d[, head(.SD, 10), by=cd]

Expected Output should display only the top ten express by frequency in each sample, so it may require that some express may have to be consolidated, for example, say I wanted to pick the top 2 instead of top 10. I would produce the table:

                        A       B       C
            express 2   0.7    0.5     0
            express 4   0.9     0       0.3
            express 5   0       0.8     0.1

So as you can see, express 1 and express 3 are dropped because no sample has it as its top 2 frequencies. The top 2 for A are express 2 and express 4, the top 2 for B are express 2 and express 5, and finally, the top 2 for sample C are expression 4 and 5.

I hope this helps clarify.

Gregor Thomas · Accepted Answer

Using dplyr and reshape2, we get the top N by group and then convert long to wide. Feel free to use your favorite long to wide method from the FAQ

n = 2
df %>% group_by(sample) %>%
  arrange(desc(Freq)) %>% 
  slice(1:n) %>% 
  reshape2::dcast(CD ~ sample, value.var = "Freq", fill = 0)
#          CD   A   B   C
# 1 express 2 0.7 0.5 0.0
# 2 express 4 0.9 0.0 0.3
# 3 express 5 0.0 0.8 0.1

Keeping top n anywhere:

df %>% group_by(sample) %>%
  arrange(desc(Freq)) %>%
  slice(1:n) %>%
  ungroup %>%
  select(CD) %>% 
  inner_join(df) %>%
  reshape2::dcast(CD ~ sample, value.var = "Freq", fill = 0,
                  fun.aggregate = mean)

Using this data:

df = read.table(text = "       CD     Freq    sample
 'express 1'  0.2        A
 'express 2'    0.7        A
 'express 2'  0.5        B
 'express 3'    0.1        B
 'express 4'  0.9        A
 'express 4'    0.3        C
 'express 5'  0.8        B
 'express 5'    0.1        C", header = T)

Extract top 10 values by group with groups on the y axis of a table

Answers (2)

Related Questions