R dplyr purrr find index value of column minimum across multiple columns and corresponding row value at index

Question

I have a dataset of air pollution values each which corresponds to a particular station. I've computed the number of nulls for each air pollution variable and station code. Now I want to find the station for each pollutant with the minimum number of nulls.

Here is the dataset:

nulls_by_code
# A tibble: 34 x 9
   code    nox   no2    o3   so2    co pm10_raw  pm10  pm25
              
 1 BL0    7195  1386  1234  4933 24408    24408  2981  2092
 2 BQ7     240   240   229 24392 24392      310   326 24392
 3 BQ8   24296 24296 24296 24296 24296    24296   323 24296
 4 CR8   24395 24395 24395 24395 24395    24395 24395   733
 5 CT3    1055  1055 24393 24393 24393     2365   971  2888
 6 EI3    1994  1994 24283 24283 24283    24283  3801 24283
 7 EN7     271   271 24392 24392 24392    24392 24392 24392
 8 HG4     205   205  1048 24392 24392    24392 24392 24392
 9 HP1   24391  8610  9596 24391 24391    24391  7987  8255
10 HP3   15633 15633 15633 15633 15633    15633 15633 15633
# … with 24 more rows

I've read a few posts about similar problems, but they're not quite what I want. There is no grouping variable in my data and I want to compute the minimum and the corresponding code for every column, something like:

nulls_by_code %>% 
     summarise(across(nox:pm25, ~ slice(which.min(.)))) %>% 
     mutate(across(nox:pm25, code)) ## I know this won't work, not sure which verb to use!

I've looked at: Find Minimum of a column and corresponding row of the minimum with condition on another column and minimum (or maximum) value of each row across multiple columns and How to select the row with the maximum value in each group but none of these have got me to dplyr dataframe nirvana.

The result should look something like this, or a transpose version of to ensure that columns have the same datatype and don't mix and .


+-----------+-----+-----+------+------+
| row_name  | nox | no2 | pm10 | pm25 |
+-----------+-----+-----+------+------+
| min_value | 205 | 205 | 323  | 733  |
| min_code  | HG4 | HG4 | BQ8  | CR8  |
+-----------+-----+-----+------+------+

I suspect dplyr grandmaster @akrun will be able to fix this in 10 seconds ... :-) Thanks for any help you are able to give.

stefan · Accepted Answer

Converting to long format via tidyr::pivot_longer and making use of dplyr::top_n this could be achieved like so:

As you want the minimum values for each pollutant we first group by pollutant and get the one row cointaining the minimum value per group by making use of top_n(1, -value) where value is a default name assigned by pivot_longer.

nulls_by_code <- read.table(text = "code    nox   no2    o3   so2    co pm10_raw  pm10  pm25
  1 BL0    7195  1386  1234  4933 24408    24408  2981  2092
2 BQ7     240   240   229 24392 24392      310   326 24392
3 BQ8   24296 24296 24296 24296 24296    24296   323 24296
4 CR8   24395 24395 24395 24395 24395    24395 24395   733
5 CT3    1055  1055 24393 24393 24393     2365   971  2888
6 EI3    1994  1994 24283 24283 24283    24283  3801 24283
7 EN7     271   271 24392 24392 24392    24392 24392 24392
8 HG4     205   205  1048 24392 24392    24392 24392 24392
9 HP1   24391  8610  9596 24391 24391    24391  7987  8255
10 HP3   15633 15633 15633 15633 15633    15633 15633 15633", header = TRUE)

library(dplyr)
library(tidyr)

nulls_by_code %>% 
  pivot_longer(-code, names_to = "pollutant") %>% 
  group_by(pollutant) %>% 
  top_n(1, -value)
#> # A tibble: 8 x 3
#> # Groups:   pollutant [8]
#>   code  pollutant value
#>         
#> 1 BL0   so2        4933
#> 2 BQ7   o3          229
#> 3 BQ7   pm10_raw    310
#> 4 BQ8   pm10        323
#> 5 CR8   pm25        733
#> 6 HG4   nox         205
#> 7 HG4   no2         205
#> 8 HP3   co        15633

R dplyr purrr find index value of column minimum across multiple columns and corresponding row value at index

Answers (1)

Related Questions