Reputation: 259
This question is very similar to Select the top N values by group but with a variation I can't seem to solve. I have a large frequency table with 3 columns and over 50 rows. Each row is independent and may the same value for a particular column. Example:
CD Freq sample
express 1 0.2 A
express 2 0.7 A
express 2 0.5 B
express 3 0.1 B
express 4 0.9 A
express 4 0.3 C
express 5 0.8 B
express 5 0.1 C
I'd like to generate a "top 10" frequency table for each sample where the rows are consolidated by each CD, and the columns are each sample (A, B, C etc), and in the intersection the frequency. Example:
A B C
express 1 0.2 0 0
express 2 0.7 0.5 0
express 3 0 0.1 0
express 4 0.3 0 0.9
express 5 0 0.8 0.1
I've generated the original table and can sort, but I don't know how to aggregate the frequencies and list them in the column by sample. But that's as far as I can wrap my newbie skills around. Any help is greatly appreciated.
data <- vdj.data %>% map_dfr(read_delim, "\t", escape_double = FALSE, trim_ws = TRUE, .id ="source") %>% mutate(source = str_replace(basename(source), ".txt", ""))
data <- vdj.data %>% select(freq, cd, "file"= source)
require(data.table)
d <- data.table(vdj.data, key="cd")
top <- d[, head(.SD, 10), by=cd]
Expected Output should display only the top ten express by frequency in each sample, so it may require that some express may have to be consolidated, for example, say I wanted to pick the top 2 instead of top 10. I would produce the table:
A B C
express 2 0.7 0.5 0
express 4 0.9 0 0.3
express 5 0 0.8 0.1
So as you can see, express 1 and express 3 are dropped because no sample has it as its top 2 frequencies. The top 2 for A are express 2 and express 4, the top 2 for B are express 2 and express 5, and finally, the top 2 for sample C are expression 4 and 5.
I hope this helps clarify.
Upvotes: 2
Views: 130
Reputation: 11255
Since you included the data.table library, here's a similar route:
library(data.table)
setorder(dt, -Freq)
n = 2
dt[dt[, .I[1:n], by = sample]$V1,
xtabs(Freq ~ CD + sample)]
#> sample
#> CD A B C
#> express2 0.7 0.5 0.0
#> express4 0.9 0.0 0.3
#> express5 0.0 0.8 0.1
#or
dcast(data = dt[dt[, .I[1:n], by = sample]$V1],
CD ~ sample,
value.var = 'Freq',
fill = 0L)
#> CD A B C
#> 1: express2 0.7 0.5 0.0
#> 2: express4 0.9 0.0 0.3
#> 3: express5 0.0 0.8 0.1
Created on 2019-09-18 by the reprex package (v0.3.0)
Data:
dt <- fread(
'CD Freq sample
express1 0.2 A
express2 0.7 A
express2 0.5 B
express3 0.1 B
express4 0.9 A
express4 0.3 C
express5 0.8 B
express5 0.1 C')
Upvotes: 2
Reputation: 145765
Using dplyr
and reshape2
, we get the top N by group and then convert long to wide. Feel free to use your favorite long to wide method from the FAQ
n = 2
df %>% group_by(sample) %>%
arrange(desc(Freq)) %>%
slice(1:n) %>%
reshape2::dcast(CD ~ sample, value.var = "Freq", fill = 0)
# CD A B C
# 1 express 2 0.7 0.5 0.0
# 2 express 4 0.9 0.0 0.3
# 3 express 5 0.0 0.8 0.1
Keeping top n anywhere:
df %>% group_by(sample) %>%
arrange(desc(Freq)) %>%
slice(1:n) %>%
ungroup %>%
select(CD) %>%
inner_join(df) %>%
reshape2::dcast(CD ~ sample, value.var = "Freq", fill = 0,
fun.aggregate = mean)
Using this data:
df = read.table(text = " CD Freq sample
'express 1' 0.2 A
'express 2' 0.7 A
'express 2' 0.5 B
'express 3' 0.1 B
'express 4' 0.9 A
'express 4' 0.3 C
'express 5' 0.8 B
'express 5' 0.1 C", header = T)
Upvotes: 2