Konrad
Konrad

Reputation: 18585

Reporting missing values per subgroup in R from a data frame

I have a data frame resembling the one below.

Group Expenditure Date
A     56434       22 June 2014
B     54231       1 July 2013
B     1412        9 May 2011
A     NA           28 July 2009
A     NA           3 July 2009
C     98          2 July 1999
C     NA           14 July 2004

I'm interested in creating missing values reports for the Expenditure column. One value should return the number of missing values per column, this is solved with use of the following code

sapply(exp.dta, function(x) sum(is.na(x)))

In addition, I would like to report the number of missing values for each date. The date column is formatted as a proper date with use of the as.Date function. As for now, I'm not interested in reporting missing values per subgroup.

Upvotes: 1

Views: 768

Answers (3)

coffeinjunky
coffeinjunky

Reputation: 11514

Try this:

 library(plyr)
 ddply(your.data, .(Date), summarize, nNA = sum(is.na(Expenditure))

This splits up the data into subgroups by Date and applies the function sum(is.na()) to the Expenditure column for these subgroups.

For instance,

 df <- read.table(text="Group Expenditure Date
 A     56434       22June2014
 B     54231       1July2013
 B     1412        9May2011
 A     NA           28July2009
 A     NA           3July2009
 C     98          2July1999
 C     NA           14July2004 ", sep="", header=T)

 ddply(df, .(Date), summarize, nNA=sum(is.na(Expenditure)))

yields:

         Date nNA
 1 14July2004   1
 2  1July2013   0
 3 22June2014   0
 4 28July2009   1
 5  2July1999   0
 6  3July2009   1
 7   9May2011   0

There are also several base-solutions. Here some examples:

  • Using by

    by(df, df$Date, function(x) sum(is.na(x$Expenditure)))
    
  • Using tapply

    with(df, tapply(Expenditure, Date, function(x) sum(is.na(x))))
    
  • Using aggregate (hat tip to @user20650)

    aggregate(df$Expenditure, by=list(df$Date), FUN= function(x) sum(is.na(x)))
    

which all give the same result but with slightly different formats. Pick which one you like the most. For a more general treatment, this kind of problem is called "split-apply-combine", see e.g. here.

Upvotes: 3

Rich Scriven
Rich Scriven

Reputation: 99321

Following the code you already have written, you could add split to it,

dat <- read.table(h=T, text = "Group Expenditure Date
  A     56434       22-June-2014
  B     54231       1-July-2013
  B     1412        9-May-2011
  A     NA           28-July-2009
  A     NA           3-July-2009
  C     98          2-July-1999
  C     NA           14-July-2004")

> sapply(split(dat$Expenditure, dat$Group), function(x) sum(is.na(x)))
# A B C 
# 2 0 1 

or for each date,

> s <- split(dat$Expenditure, dat$Date)
> as.matrix(sapply(s, function(x) sum(is.na(x))))
#              [,1]
# 14-July-2004    1
# 1-July-2013     0
# 22-June-2014    0
# 28-July-2009    1
# 2-July-1999     0
# 3-July-2009     1
# 9-May-2011      0

Upvotes: 2

Kara Woo
Kara Woo

Reputation: 3615

Or, using dplyr:

library('dplyr')
summarize(group_by(df, Date), nNA = sum(is.na(Expenditure)))

Upvotes: 1

Related Questions