Reputation: 23
I want to find a way to extract the n rows that contain the Top results (min and max) in a dataframe. The problem is that this dataframe contains strings and NA and also groups. Also if the top results are in the same row, I still need exactly n rows, so being in the same row counts just as 1 result.
V01_Code V01_Corr V01_Lag V02_Code V02_Corr V02_Lag V03_Code V03_Corr V03_Lag V04_Code V04_Corr V04_Lag Group
1 AMI 0.63 L7 <NA> NA <NA> <NA> NA <NA> <NA> NA <NA> B
2 CII -0.61 L7 CMI -0.53 L7 <NA> NA <NA> <NA> NA <NA> A
3 AFI 0.51 L7 <NA> NA <NA> <NA> NA <NA> <NA> NA <NA> A
4 AII 0.52 L7 BII 0.62 L4 BMI 0.60 L7 III 0.58 L4 B
5 BII 0.52 L7 IIA 0.74 L6 III 0.51 L7 IMA 0.75 L6 A
6 AII 0.58 L6/L7 BII 0.69 L4 BMI 0.70 L7 IIA 0.57 L4 A
7 IIA 0.58 L6 IMA 0.59 L6 IMI 0.52 L6 <NA> NA <NA> B
8 IMU 0.52 L6 <NA> NA <NA> <NA> NA <NA> <NA> NA <NA> A
I tried several versions like this:
aggregate(. ~ Group, df, function(x) max(head(sort(x),2),na.rm=T))
But it doesnt seem to work! As output I want a dataframe of the rows (for example 2 rows here) that contain the highest and lowest values. So in this case 0.75 in row 5 is the highest value, 2nd highest is in the same row which doesnt count then. The 2nd highest in any other row would be 0.7 in row 6. So for my top 2 result of max values I want:
V01_Code V01_Corr V01_Lag V02_Code V02_Corr V02_Lag V03_Code V03_Corr V03_Lag V04_Code V04_Corr V04_Lag Group
5 BII 0.52 L7 IIA 0.74 L6 III 0.51 L7 IMA 0.75 L6 A
6 AII 0.58 L6/L7 BII 0.69 L4 BMI 0.70 L7 IIA 0.57 L4 A
1 AMI 0.63 L7 <NA> NA <NA> <NA> NA <NA> <NA> NA <NA> B
4 AII 0.52 L7 BII 0.62 L4 BMI 0.60 L7 III 0.58 L4 B
n in this case would be 2, so the 2 rows that contain the maximum values for each group.
Here is my dataframe
structure(list(V01_Code = c("AMI", "CII", "AFI", "AII", "BII",
"AII", "IIA", "IMU"), V01_Corr = c(0.63, -0.61, 0.51, 0.52, 0.52,
0.58, 0.58, 0.52), V01_Lag = c("L7", "L7", "L7", "L7", "L7",
"L6/L7", "L6", "L6"), V02_Code = c(NA, "CMI", NA, "BII", "IIA",
"BII", "IMA", NA), V02_Corr = c(NA, -0.53, NA, 0.62, 0.74, 0.69,
0.59, NA), V02_Lag = c(NA, "L7", NA, "L4", "L6", "L4", "L6",
NA), V03_Code = c(NA, NA, NA, "BMI", "III", "BMI", "IMI", NA),
V03_Corr = c(NA, NA, NA, 0.6, 0.51, 0.7, 0.52, NA), V03_Lag = c(NA,
NA, NA, "L7", "L7", "L7", "L6", NA), V04_Code = c(NA, NA,
NA, "III", "IMA", "IIA", NA, NA), V04_Corr = c(NA, NA, NA,
0.58, 0.75, 0.57, NA, NA), V04_Lag = c(NA, NA, NA, "L4",
"L6", "L4", NA, NA), Group = c("B", "A", "A", "B", "A", "A",
"B", "A")), row.names = c("1", "2", "3", "4", "5", "6",
"7", "8"), class = "data.frame")
Upvotes: 0
Views: 36
Reputation: 887851
Here is an option with reshaping i.e. create a row sequence (row_number
) column, reshape from wide to long with pivot_longer
, arrange the rows by 'Group' and the 'value' column in desc
ending order, then filter
the first 'n' unique
'rn' - row_number column, ungroup
and reshape back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = ends_with("Corr"), names_to = 'Corr') %>%
arrange(Group, desc(value)) %>%
group_by(Group) %>%
filter(rn %in% head(unique(rn), 2)) %>%
ungroup %>%
select(-rn) %>%
pivot_wider(names_from = Corr, values_from = value)
-output
# A tibble: 4 x 13
V01_Code V01_Lag V02_Code V02_Lag V03_Code V03_Lag V04_Code V04_Lag Group V04_Corr V02_Corr V03_Corr V01_Corr
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 BII L7 IIA L6 III L7 IMA L6 A 0.75 0.74 0.51 0.52
2 AII L6/L7 BII L4 BMI L7 IIA L4 A 0.57 0.69 0.7 0.58
3 AMI L7 <NA> <NA> <NA> <NA> <NA> <NA> B NA NA NA 0.63
4 AII L7 BII L4 BMI L7 III L4 B 0.58 0.62 0.6 0.52
Upvotes: 2