Tox
Tox

Reputation: 854

Count number of observations without N/A per year in R

I have a dataset and I want to summarize the number of observations without the missing values (denoted by NA).

My data is similar as the following:

data <- read.table(header = TRUE, 
               stringsAsFactors = FALSE, 
               text="CompanyNumber ResponseVariable Year ExplanatoryVariable1 ExplanatoryVariable2
               1 2.5 2000 1 2
               1 4 2001 3 1
               1 3 2002 NA 7
               2 1 2000 3 NA
               2 2.4 2001 0 4
               2 6 2002 2 9
               3 10 2000 NA 3")

I was planning to use the package dplyr, but that does only take the years into account and not the different variables:

library(dplyr)
data %>% 
  group_by(Year) %>%
  summarise(number = n())

How can I obtain the following outcome?

                    2000 2001 2002
ExplanatoryVariable1  2   2    1 
ExplanatoryVariable2  2   2    2

Upvotes: 2

Views: 3433

Answers (4)

Onyambu
Onyambu

Reputation: 79288

Using base R:

  do.call(cbind,by(data[3:5], data$Year,function(x) colSums(!is.na(x[-1]))))                       
                      2000 2001 2002
 ExplanatoryVariable1    2    2    1
 ExplanatoryVariable2    2    2    2

For aggregate:

 aggregate(.~Year,data[3:5],function(x) sum(!is.na(x)),na.action = function(x)x)

Upvotes: 4

talat
talat

Reputation: 70296

To get the counts, you can start by using:

library(dplyr)
data %>% 
  group_by(Year) %>% 
  summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.)))
## A tibble: 3 x 3
#   Year ExplanatoryVariable1 ExplanatoryVariable2
#  <int>                <int>                <int>
#1  2000                    2                    2
#2  2001                    2                    2
#3  2002                    1                    2

If you want to reshape it as shown in your question, you can extend the pipe using tidyr functions:

library(tidyr)
data %>% 
  group_by(Year) %>% 
  summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.))) %>% 
  gather(var, count, -Year) %>% 
  spread(Year, count)
## A tibble: 2 x 4
#                   var `2000` `2001` `2002`
#*                <chr>  <int>  <int>  <int>
#1 ExplanatoryVariable1      2      2      1
#2 ExplanatoryVariable2      2      2      2

Just to let OP know, since they have ~200 explanatory variables to select. You can use another option of summarise_at to select the variables. You can simply name the first:last variable, if they are ordered correctly in the data, for example:

data %>% 
  group_by(Year) %>%
  summarise_at(vars(ExplanatoryVariable1:ExplanatoryVariable2), ~sum(!is.na(.))) 

Or:

data %>% 
  group_by(Year) %>% 
  summarise_at(3:4, ~sum(!is.na(.))) 

Or store the variable names in a vector and use that:

vars <- names(data)[4:5]
data %>% 
  group_by(Year) %>% 
  summarise_at(vars, ~sum(!is.na(.))) 

Upvotes: 7

d.b
d.b

Reputation: 32548

You could do it with aggregate in base R.

aggregate(list(ExplanatoryVariable1 = data$ExplanatoryVariable1,
               ExplanatoryVariable2 = data$ExplanatoryVariable2),
          list(Year = data$Year),
          function(x) length(x[!is.na(x)]))
#  Year ExplanatoryVariable1 ExplanatoryVariable2
#1 2000                    2                    2
#2 2001                    2                    2
#3 2002                    1                    2

Upvotes: 1

cimentadaj
cimentadaj

Reputation: 1488

data %>%
  gather(cat, val, -(1:3)) %>%
  filter(complete.cases(.)) %>%
  group_by(Year, cat) %>%
  summarize(n = n()) %>%
  spread(Year, n)

# # A tibble: 2 x 4
#                    cat `2000` `2001` `2002`
# *                <chr>  <int>  <int>  <int>
# 1 ExplanatoryVariable1      2      2      1
# 2 ExplanatoryVariable2      2      2      2

Should do it. You start by making the data stacked, and the simply calculating the n for both year and each explanatory variable. If you want the data back to wide format, then use spread, but either way without spread, you get the counts for both variables.

Upvotes: 4

Related Questions