Agile Bean
Agile Bean

Reputation: 7161

Detect number of missing values per dataframe column with dplyr & purrr

Taking a simple dataframe from the R built-in dataset airquality and checking their missing values:

airquality %>% summary

While this works:

airquality %>% map_df(is.na) %>% map_df(sum)

  Ozone Solar.R  Wind  Temp Month   Day
  <int>   <int> <int> <int> <int> <int>
1    37       7     0     0     0     0

, and this - in purrr syntax - works too:

airquality %>% map_df(~sum(is.na(.)))
  Ozone Solar.R  Wind  Temp Month   Day
  <int>   <int> <int> <int> <int> <int>
1    37       7     0     0     0     0

, this doesn't work:

airquality %>% map_df(sum(is.na(.)))

  Ozone Solar.R  Wind  Temp Month   Day
  <int>   <int> <dbl> <int> <int> <int>
1    23     148     8    82     6    13

My question is: How can you explain the last result?

Where exactly does the calculation happen - in dplyr or purrr?

Upvotes: 1

Views: 827

Answers (1)

Aur&#232;le
Aur&#232;le

Reputation: 12839

The behavior of the various syntaxes around %>% is explained in detail in help("%>%", package = "magrittr").

In this specific instance, sum(is.na(.)) isn't interpreted as an anonymous function, like OP seems to expect, thus . isn't the argument to an anonymous function.

Instead, . is the LHS (left hand side) of the pipe.

airquality %>% map_df(sum(is.na(.))) could be unfolded as map_df(airquality, .f = sum(is.na(airquality))).

sum(is.na(airquality)) evals to 44, and from help("map_df"), if the .f argument to map_df is a numeric vector,

it is converted to an extractor function

Summing up: this is extracting the 44th element of each column, and constraining it back to a data frame. Or, with some oversimplification, this extracts the 44th row.

Upvotes: 2

Related Questions