Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.

Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.

Here some insights to my dataframe and function:

My raw dataframe looks +/- as follows:

| year | quarter | area | time_comb | no_individuals | lenCls | age |
| 2005 | 1       | 24   | 2005.1.24 | 8              | 380    | 3   |
| 2005 | 2       | 24   | 2005.2.24 | 4              | 490    | 2   |
| 2005 | 1       | 24   | 2005.1.24 | 3              | 460    | 6   |
| 2005 | 1       | 21   | 2005.1.21 | 25             | 400    | 2   |
| 2005 | 2       | 24   | 2005.2.24 | 1              | 680    | 6   |
| 2005 | 2       | 21   | 2005.2.21 | 2              | 620    | 5   |
| 2005 | 3       | 21   | 2005.3.21 | NA             | NA     | NA  |
| 2005 | 1       | 21   | 2005.1.21 | 1              | 510    | 5   |
| 2005 | 1       | 24   | 2005.1.24 | 1              | 670    | 4   |
| 2006 | 1       | 22   | 2006.1.22 | 2              | 750    | 4   |
| 2006 | 4       | 24   | 2006.4.24 | 1              | 660    | 8   |
| 2006 | 2       | 24   | 2006.2.24 | 8              | 540    | 3   |
| 2006 | 2       | 24   | 2006.2.24 | 4              | 560    | 3   |
| 2006 | 1       | 22   | 2006.1.22 | 2              | 250    | 2   |
| 2006 | 3       | 22   | 2006.3.22 | 1              | 520    | 2   |
| 2006 | 2       | 24   | 2006.2.24 | 1              | 500    | 2   |
| 2006 | 2       | 22   | 2006.2.22 | NA             | NA     | NA  |
| 2006 | 2       | 21   | 2006.2.21 | 3              | 480    | 2   |
| 2006 | 1       | 24   | 2006.1.24 | 1              | 640    | 5   |
| 2007 | 4       | 21   | 2007.4.21 | 2              | 620    | 3   |
| 2007 | 2       | 21   | 2007.2.21 | 1              | 430    | 3   |
| 2007 | 4       | 22   | 2007.4.22 | 14             | 410    | 2   |
| 2007 | 1       | 24   | 2007.1.24 | NA             | NA     | NA  |
| 2007 | 2       | 24   | 2007.2.24 | NA             | NA     | NA  |
| 2007 | 3       | 24   | 2007.3.22 | NA             | NA     | NA  |
| 2007 | 4       | 24   | 2007.4.24 | NA             | NA     | NA  |
| 2007 | 3       | 21   | 2007.3.21 | 1              | 560    | 4   |
| 2007 | 1       | 21   | 2007.1.21 | 7              | 300    | 3   |
| 2007 | 3       | 23   | 2007.3.23 | 1              | 640    | 5   |

Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!

So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.

So far my basic function looks as follows:

LAK <- function(df,  Year="2005", Quarter="1", Area="22", alkplot=T){

  # subset alk by year, quarter and area 
  sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
  dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
  raw <- t(table(dfexp$lenCls, dfexp$age))
  key <- round(prop.table(raw, margin=1), 3)


From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.

Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.

I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.

So, any help here will be very much appreciated.

Here my LAK function which I'm trying to update:

LAK <- function(df,  Year="2005", Quarter="1", Area="22", alkplot=T){

      # subset alk by year, quarter and area 
      sALK <- subset(df, year==Year & quarter==Quarter & area==Area)

     # In case of empty dataset 
     #if( && nrow(sALK)==0){

     if(sALK[rowSums( > 0,]){
     warning("Empty subset combination; data will be subsetted based on the 
     nearest timestep combination") 



      dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
      raw <- t(table(dfexp$lenCls, dfexp$age))
      key <- round(prop.table(raw, margin=1), 3)


Upvotes: 1

Views: 77

Answers (2)

So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:

LAK <- function(df,  Year="2005", Quarter="1", Area="22",alkplot=T){


  # subset alk by year, quarter, area and species
  sALK <- subset(df, year==Year & quarter==Quarter & area==Area)

    warning("Empty subset combination; data has been subsetted to the nearest input combination") 
    syear <- unique(as.numeric(as.character(sALK$year)))
    sarea <- unique(as.numeric(as.character(sALK$area)))

    sALK2 <- subset(df, year==syear & area==sarea)
    vals <-$comb_index))
    colnames(vals)[1] <- "comb_index" 

    idx <- which(vals$Freq>1)
    quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))

    imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)  
    dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
    raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
    key2 <- round(prop.table(raw2, margin=1), 3)


  }  else {
    dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
    raw <- t(table(dfexp$lenCls, dfexp$age))
    key <- round(prop.table(raw, margin=1), 3)  



This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area. For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.

Upvotes: 1


Reputation: 2467

I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

Upvotes: 0

Related Questions