Kimster
Kimster

Reputation: 23

R - Extracting rows of max/min values in a dataframe containing strings, NA and groups

I want to find a way to extract the n rows that contain the Top results (min and max) in a dataframe. The problem is that this dataframe contains strings and NA and also groups. Also if the top results are in the same row, I still need exactly n rows, so being in the same row counts just as 1 result.

  V01_Code V01_Corr  V01_Lag  V02_Code V02_Corr V02_Lag V03_Code V03_Corr V03_Lag V04_Code V04_Corr V04_Lag Group
1      AMI     0.63      L7     <NA>       NA    <NA>     <NA>       NA    <NA>     <NA>       NA    <NA>     B
2      CII    -0.61      L7      CMI    -0.53      L7     <NA>       NA    <NA>     <NA>       NA    <NA>     A
3      AFI     0.51      L7     <NA>       NA    <NA>     <NA>       NA    <NA>     <NA>       NA    <NA>     A
4      AII     0.52      L7      BII     0.62      L4      BMI     0.60      L7      III     0.58      L4     B
5      BII     0.52      L7      IIA     0.74      L6      III     0.51      L7      IMA     0.75      L6     A
6      AII     0.58   L6/L7      BII     0.69      L4      BMI     0.70      L7      IIA     0.57      L4     A
7      IIA     0.58      L6      IMA     0.59      L6      IMI     0.52      L6     <NA>       NA    <NA>     B
8      IMU     0.52      L6     <NA>       NA    <NA>     <NA>       NA    <NA>     <NA>       NA    <NA>     A

I tried several versions like this:

aggregate(. ~ Group, df, function(x) max(head(sort(x),2),na.rm=T))

But it doesnt seem to work! As output I want a dataframe of the rows (for example 2 rows here) that contain the highest and lowest values. So in this case 0.75 in row 5 is the highest value, 2nd highest is in the same row which doesnt count then. The 2nd highest in any other row would be 0.7 in row 6. So for my top 2 result of max values I want:

   V01_Code V01_Corr  V01_Lag  V02_Code V02_Corr V02_Lag V03_Code V03_Corr V03_Lag V04_Code V04_Corr V04_Lag Group
5      BII     0.52      L7      IIA     0.74      L6      III     0.51      L7      IMA     0.75      L6     A
6      AII     0.58   L6/L7      BII     0.69      L4      BMI     0.70      L7      IIA     0.57      L4     A
1      AMI     0.63      L7     <NA>       NA    <NA>     <NA>       NA    <NA>     <NA>       NA    <NA>     B
4      AII     0.52      L7      BII     0.62      L4      BMI     0.60      L7      III     0.58      L4     B

n in this case would be 2, so the 2 rows that contain the maximum values for each group.

Here is my dataframe

structure(list(V01_Code = c("AMI", "CII", "AFI", "AII", "BII", 
"AII", "IIA", "IMU"), V01_Corr = c(0.63, -0.61, 0.51, 0.52, 0.52, 
0.58, 0.58, 0.52), V01_Lag = c("L7", "L7", "L7", "L7", "L7", 
"L6/L7", "L6", "L6"), V02_Code = c(NA, "CMI", NA, "BII", "IIA", 
"BII", "IMA", NA), V02_Corr = c(NA, -0.53, NA, 0.62, 0.74, 0.69, 
0.59, NA), V02_Lag = c(NA, "L7", NA, "L4", "L6", "L4", "L6", 
NA), V03_Code = c(NA, NA, NA, "BMI", "III", "BMI", "IMI", NA), 
    V03_Corr = c(NA, NA, NA, 0.6, 0.51, 0.7, 0.52, NA), V03_Lag = c(NA, 
    NA, NA, "L7", "L7", "L7", "L6", NA), V04_Code = c(NA, NA, 
    NA, "III", "IMA", "IIA", NA, NA), V04_Corr = c(NA, NA, NA, 
    0.58, 0.75, 0.57, NA, NA), V04_Lag = c(NA, NA, NA, "L4", 
    "L6", "L4", NA, NA), Group = c("B", "A", "A", "B", "A", "A", 
    "B", "A")), row.names = c("1", "2", "3", "4", "5", "6", 
"7", "8"), class = "data.frame")

Upvotes: 0

Views: 36

Answers (1)

akrun
akrun

Reputation: 887851

Here is an option with reshaping i.e. create a row sequence (row_number) column, reshape from wide to long with pivot_longer, arrange the rows by 'Group' and the 'value' column in descending order, then filter the first 'n' unique 'rn' - row_number column, ungroup and reshape back to 'wide' format with pivot_wider

library(dplyr)
library(tidyr)
df1 %>%
     mutate(rn = row_number()) %>% 
     pivot_longer(cols = ends_with("Corr"), names_to = 'Corr') %>% 
     arrange(Group, desc(value)) %>%
     group_by(Group) %>% 
     filter(rn %in% head(unique(rn), 2)) %>% 
     ungroup %>% 
     select(-rn) %>% 
     pivot_wider(names_from = Corr, values_from = value)

-output

# A tibble: 4 x 13
  V01_Code V01_Lag V02_Code V02_Lag V03_Code V03_Lag V04_Code V04_Lag Group V04_Corr V02_Corr V03_Corr V01_Corr
  <chr>    <chr>   <chr>    <chr>   <chr>    <chr>   <chr>    <chr>   <chr>    <dbl>    <dbl>    <dbl>    <dbl>
1 BII      L7      IIA      L6      III      L7      IMA      L6      A         0.75     0.74     0.51     0.52
2 AII      L6/L7   BII      L4      BMI      L7      IIA      L4      A         0.57     0.69     0.7      0.58
3 AMI      L7      <NA>     <NA>    <NA>     <NA>    <NA>     <NA>    B        NA       NA       NA        0.63
4 AII      L7      BII      L4      BMI      L7      III      L4      B         0.58     0.62     0.6      0.52

Upvotes: 2

Related Questions