Reputation: 799
I have a dataset of air pollution values each which corresponds to a particular station. I've computed the number of nulls for each air pollution variable and station code. Now I want to find the station for each pollutant with the minimum number of nulls.
Here is the dataset:
nulls_by_code
# A tibble: 34 x 9
code nox no2 o3 so2 co pm10_raw pm10 pm25
<chr> <int> <int> <int> <int> <int> <int> <int> <int>
1 BL0 7195 1386 1234 4933 24408 24408 2981 2092
2 BQ7 240 240 229 24392 24392 310 326 24392
3 BQ8 24296 24296 24296 24296 24296 24296 323 24296
4 CR8 24395 24395 24395 24395 24395 24395 24395 733
5 CT3 1055 1055 24393 24393 24393 2365 971 2888
6 EI3 1994 1994 24283 24283 24283 24283 3801 24283
7 EN7 271 271 24392 24392 24392 24392 24392 24392
8 HG4 205 205 1048 24392 24392 24392 24392 24392
9 HP1 24391 8610 9596 24391 24391 24391 7987 8255
10 HP3 15633 15633 15633 15633 15633 15633 15633 15633
# … with 24 more rows
I've read a few posts about similar problems, but they're not quite what I want. There is no grouping variable in my data and I want to compute the minimum and the corresponding code for every column, something like:
nulls_by_code %>%
summarise(across(nox:pm25, ~ slice(which.min(.)))) %>%
mutate(across(nox:pm25, code)) ## I know this won't work, not sure which verb to use!
I've looked at: Find Minimum of a column and corresponding row of the minimum with condition on another column and minimum (or maximum) value of each row across multiple columns and How to select the row with the maximum value in each group but none of these have got me to dplyr dataframe nirvana.
The result should look something like this, or a transpose version of to ensure that columns have the same datatype and don't mix and .
+-----------+-----+-----+------+------+
| row_name | nox | no2 | pm10 | pm25 |
+-----------+-----+-----+------+------+
| min_value | 205 | 205 | 323 | 733 |
| min_code | HG4 | HG4 | BQ8 | CR8 |
+-----------+-----+-----+------+------+
I suspect dplyr grandmaster @akrun will be able to fix this in 10 seconds ... :-) Thanks for any help you are able to give.
Upvotes: 0
Views: 416
Reputation: 123783
Converting to long format via tidyr::pivot_longer
and making use of dplyr::top_n
this could be achieved like so:
As you want the minimum values for each pollutant we first group by pollutant
and get the one row cointaining the minimum value per group by making use of top_n(1, -value)
where value
is a default name assigned by pivot_longer
.
nulls_by_code <- read.table(text = "code nox no2 o3 so2 co pm10_raw pm10 pm25
1 BL0 7195 1386 1234 4933 24408 24408 2981 2092
2 BQ7 240 240 229 24392 24392 310 326 24392
3 BQ8 24296 24296 24296 24296 24296 24296 323 24296
4 CR8 24395 24395 24395 24395 24395 24395 24395 733
5 CT3 1055 1055 24393 24393 24393 2365 971 2888
6 EI3 1994 1994 24283 24283 24283 24283 3801 24283
7 EN7 271 271 24392 24392 24392 24392 24392 24392
8 HG4 205 205 1048 24392 24392 24392 24392 24392
9 HP1 24391 8610 9596 24391 24391 24391 7987 8255
10 HP3 15633 15633 15633 15633 15633 15633 15633 15633", header = TRUE)
library(dplyr)
library(tidyr)
nulls_by_code %>%
pivot_longer(-code, names_to = "pollutant") %>%
group_by(pollutant) %>%
top_n(1, -value)
#> # A tibble: 8 x 3
#> # Groups: pollutant [8]
#> code pollutant value
#> <chr> <chr> <int>
#> 1 BL0 so2 4933
#> 2 BQ7 o3 229
#> 3 BQ7 pm10_raw 310
#> 4 BQ8 pm10 323
#> 5 CR8 pm25 733
#> 6 HG4 nox 205
#> 7 HG4 no2 205
#> 8 HP3 co 15633
Upvotes: 1