MelaniaCB
MelaniaCB

Reputation: 435

Select top values per column in R

This might be a basic question but I haven't found an answear that fits my needs even though there are many alike.

I am trying to select the top 3 values from each column (keeping the row number as an id) but I haven't been able to find the right function for it.

I have a matrix like this from the beggining, using that code to add the id column

top_probs <- doc_topic_distr %>% 
  magrittr::set_rownames(seq_len(nrow(.))) %>% 
  as_tibble(rownames = "id") 

    id          V1          V2          V3          V4          V5          V6          V7          V8          V9          V10          V11          V12
1   1       0.000000000 0.000000000 0.133333333 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.181481481 0.685185185
2   2       0.950000000 0.000000000 0.050000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
3   3       0.028571429 0.114285714 0.814285714 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.014285714 0.028571429
4   4       0.000000000 0.000000000 0.000000000 0.002127660 0.240425532 0.136170213 0.408510638 0.076595745 0.000000000 0.000000000 0.000000000 0.136170213
5   5       0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.025000000 0.025000000 0.050000000 0.900000000
6   6       0.000000000 0.000000000 0.000000000 0.000000000 0.076923077 0.384615385 0.000000000 0.000000000 0.000000000 0.000000000 0.284615385 0.253846154
7   7       0.000000000 0.000000000 0.347826087 0.000000000 0.000000000 0.000000000 0.243478261 0.000000000 0.026086957 0.000000000 0.143478261 0.239130435
8   8       0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.024000000 0.004000000 0.000000000 0.460000000 0.000000000 0.224000000 0.288000000
9   9       0.000000000 0.000000000 0.311111111 0.000000000 0.011111111 0.000000000 0.011111111 0.000000000 0.000000000 0.000000000 0.388888889 0.277777778
10  10      0.000000000 0.466666667 0.000000000 0.000000000 0.000000000 0.266666667 0.200000000 0.000000000 0.066666667 0.000000000 0.000000000 0.000000000
11  11      0.000000000 0.153333333 0.006666667 0.000000000 0.000000000 0.826666667 0.000000000 0.013333333 0.000000000 0.000000000 0.000000000 0.000000000
12  12      0.295833333 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.404166667 0.004166667 0.000000000 0.000000000 0.000000000 0.295833333
13  13      0.000000000 0.000000000 0.000000000 0.000000000 0.009090909 0.790909091 0.154545455 0.009090909 0.009090909 0.000000000 0.027272727 0.000000000
14  14      0.000000000 0.155555556 0.000000000 0.000000000 0.000000000 0.033333333 0.033333333 0.011111111 0.000000000 0.533333333 0.011111111 0.222222222
15  15      0.055555556 0.000000000 0.533333333 0.000000000 0.000000000 0.000000000 0.177777778 0.005555556 0.000000000 0.000000000 0.227777778 0.000000000
16  16      0.000000000 0.153333333 0.006666667 0.000000000 0.000000000 0.826666667 0.000000000 0.013333333 0.000000000 0.000000000 0.000000000 0.000000000
17  17      0.295833333 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.404166667 0.004166667 0.000000000 0.000000000 0.000000000 0.295833333
18  18      0.000000000 0.000000000 0.000000000 0.000000000 0.009090909 0.790909091 0.154545455 0.009090909 0.009090909 0.000000000 0.027272727 0.000000000
19  19      0.000000000 0.155555556 0.000000000 0.000000000 0.000000000 0.033333333 0.033333333 0.011111111 0.000000000 0.533333333 0.011111111 0.222222222
20  20      0.055555556 0.000000000 0.533333333 0.000000000 0.000000000 0.000000000 0.177777778 0.005555556 0.000000000 0.000000000 0.227777778 0.000000000

Now, I want to know if there is a way to use top_frac() based on every column, meaning like I want 20% of my data gathered by the same number of highest probability rows per columns. Like if the 20% of the whole data was a 120, then I would get a matrix merging the highest 10 probabilities for each column. It would be easy doing it based on a single column, but I don't know how to do it based proportionally on each one of them.

Upvotes: 0

Views: 509

Answers (1)

jeffverboon
jeffverboon

Reputation: 336

Following up from above, it would be something like:

df %>%
  gather(column, value, -id) %>%
  group_by(id, column) %>%
  top_n(3)

Upvotes: 2

Related Questions