DaniCee
DaniCee

Reputation: 3217

Fill NA values in a vector with last non-NA value plus the values in another vector in a rolling manner

I have an already ordered data frame that looks like the following:

mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))

  ID   Level                      Taxonomy Letter
1 A1  domain                  D__Eukaryota      D
2 A1 kingdom             K__Chloroplastida      K
3 A1  phylum                          <NA>      P
4 A1   class            C__Mamiellophyceae      C
5 A1   order                          <NA>      O
6 A1  family                          <NA>      F
7 A1   genus               G__Crustomastix      G
8 A1 species S__Crustomastix sp. MBIC10709      S

What I would like is to replace the NA values with the last non-NA value, adding the ALL the Letters "missed" at the beginning in a rolling fashion... See what I mean below.

The goal is to obtain a data frame like this:

  ID   Level                      Taxonomy Letter
1 A1  domain                  D__Eukaryota      D
2 A1 kingdom             K__Chloroplastida      K
3 A1  phylum          P__K__Chloroplastida      P
4 A1   class            C__Mamiellophyceae      C
5 A1   order         O__C__Mamiellophyceae      O
6 A1  family      F__O__C__Mamiellophyceae      F
7 A1   genus               G__Crustomastix      G
8 A1 species S__Crustomastix sp. MBIC10709      S

Notice the last 2 NAs, how the last one has to carry the value of the previous. See how the first one of the two starts with O__C and the last one with F__O__C.

So far, my best attempt is the following (thanks to Ajay Ohri):

library(zoo)
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
mydf <- data.frame(lapply(mydf, as.character), stringsAsFactors=FALSE)
mydf$Letter2 <- ifelse(is.na(mydf$Taxonomy),paste(mydf$Letter,'__',sep=''),"")
mydf
mydf$Taxonomy <- paste(mydf$Letter2, na.locf(mydf$Taxonomy), sep='')
mydf

Notice how I still don't manage to do it in a rolling manner (I get F__C instead of F__O__C for the last NA). Any help? Thanks!

PS: let me know if it is still confusing, so I make another MWE with more NAs in a row, so it's more obvious what I need.

Upvotes: 2

Views: 1264

Answers (5)

Diego-MX
Diego-MX

Reputation: 2349

An approach that fills NA values at the beginning with NAs, and also simplifies the logic to work with groups:

forward_fill <- function (x) {
  if (length(x) == 0) return (vector(mode(x), 0))      

  xt  <- tail(x, -1)  
  x0  <- c(x[1], xt[!is.na(xt)])
  id0 <- c(TRUE,    !is.na(xt))
  y   <- x0[cumsum(id0)]
  
  return (y)
}

Upvotes: 0

R Yoda
R Yoda

Reputation: 8770

Since you mentioned memory and performance problems you switched to the accepted data.table solution.

I am adding another data.table variant which does not depend on other packages like zoo and may be fast enough if the Taxonomy column does not contain too long sequences of NAs since the longest sequence determines the number of repetitions of the while loop (e. g. two rep. in case of the example data):

library(data.table)

mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))

setDT(mydf)

# Fill NA value in "Taxonomy" with the value of the prev. row until no NAs occur anymore
prev.number.NAs <- 0    # required to stop the loop if no more NA values can be carried forward
repeat {
  number.NAs <- sum(is.na(mydf$Taxonomy))
  if( number.NAs == 0 | number.NAs == prev.number.NAs) break;
  mydf[, filler := shift(Taxonomy), by = .(ID)]     # fill temporary working column with the value of the prev. row of the same group
  mydf[!is.na(filler) & is.na(Taxonomy), Taxonomy := paste0(Letter, "__", filler)]
  prev.number.NAs <- number.NAs
}


mydf[, filler := NULL]   # remove working column
mydf

Unfortunately the shift function of data.table does not offer a "last observation carry forward" parameter so that I had to use a while loop.

Update 1: As @UweBlock mentioned in his comment below I have replaced the while loop by a repeat loop to avoid an endless loop in case of an NA value in the column Taxonomy in the first row. THX for finding this!

Update 2: Carying forward the last observation is now only done within the same group of data (defined by the columns ID - as the OP indicated in a comment). Thx to @UweBlock for pointing out this issue!

Upvotes: 1

Uwe
Uwe

Reputation: 42592

As the OP has mentioned that memory consumption is crucial, here is a data.table approach which uses the na.locf() function from the zoo package:

library(data.table)   # CRAN version 1.10.4 used
# coerce to data.table, convert factors to characters
DT <- data.table(mydf)[, lapply(.SD, as.character)]
# set marker for NA rows 
DT[, na := is.na(Taxonomy)][]
# fill NA by Last Observation Carried Forward
DT[, Taxonomy := zoo::na.locf(Taxonomy)][]
# create list of Letters and unique row count within each group of missing taxonomies
DT[(na), `:=`(tmp = .(Letter), rn = seq_len(.N)), by = .(ID, Taxonomy)][]
# replace incomplete taxonomies
DT[(na), Taxonomy := paste(c(rev(unlist(tmp)[1:rn]), Taxonomy), collapse = "__"), 
   by = .(ID, Taxonomy, rn)][]
# clean up
DT[, c("na", "tmp", "rn") := NULL][]
   ID   Level                      Taxonomy Letter
1: A1  domain                  D__Eukaryota      D
2: A1 kingdom             K__Chloroplastida      K
3: A1  phylum          P__K__Chloroplastida      P
4: A1   class            C__Mamiellophyceae      C
5: A1   order         O__C__Mamiellophyceae      O
6: A1  family      F__O__C__Mamiellophyceae      F
7: A1   genus               G__Crustomastix      G
8: A1 species S__Crustomastix sp. MBIC10709      S

I've refrained from chaining the expressions, so the code can be executed step by step.

Note that data.table is updating in place without copying the whole data set which saves memory as well as time.

Prerequisites and additional explanations

In response to this comment, the OP has confirmed that the starting data frame is ordered and non-redundant and that ID+Level should be the unique key of the data frame.

However, as the solution above depends on these assumptions it is worthwhile to add some checks:

# (1) ID + Level are unique keys: find duplicate Levels per ID
stopifnot(anyDuplicated(DT, by = c("ID", "Level")) == 0L)
# (2) rows missing: count rows per ID, there should be 8 Levels
DT[, .N, by = ID][, stopifnot(all(N == 8L))]
# (3) order, wrong Level names, and tests (1) and (2) as well
# create data.table with Level in proper order and a sequence number ln
levels <- data.table(
  ln = 1:8,
  Level = c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
)
# left inner join, i.e., keep only rows with matching Level, keep order of DT
# then check for consecutively ascending level sequence numbers
levels[DT, on = "Level", nomatch = 0][, stopifnot(all(diff(ln) == 1L)), by = ID]

In addition, it has to be made sure that at least for the top Level "domain" the Taxonomy is specified. This can be doublechecked with:

# count number of rows with missing Taxonomy on top level "domain"
stopifnot(nrow(DT[Level == "domain" & is.na(Taxonomy)] == 0L))

The grouping logic by = .(ID, Taxonomy) is been used together with the selection on na, i.e. DT[(na), ..., in order to prepend the additional letters to Taxonomy where Taxonomywas originally missing. During development of the solution, I had introduced an additional helper column gn := rleid(ID, Taxonomy) which would cover duplicates as mentioned in this comment, Finally, I recognized that I can scrape this column because of the prerequisites.

Upvotes: 3

Sotos
Sotos

Reputation: 51592

One way to do this is to use Reduce with accumulate = TRUE argument. i.e.

ind <- is.na(mydf$Taxonomy)
mydf$Taxonomy <- zoo::na.locf(mydf$Taxonomy)
mydf$Taxonomy[ind] <- paste0(with(mydf, ave(Level, Taxonomy, FUN = function(i) 
     Reduce(paste, toupper(substr(rev(i), 1, 1)), accumulate = TRUE)))[ind], '_', 
                                                      sub('.*_', '', mydf$Taxonomy[ind]))

mydf$Taxonomy <- gsub(' ', '_', mydf$Taxonomy)


mydf
#  ID   Level                      Taxonomy Letter
#1 A1  domain                  D__Eukaryota      D
#2 A1 kingdom             K__Chloroplastida      K
#3 A1  phylum            P_K_Chloroplastida      P
#4 A1   class            C__Mamiellophyceae      C
#5 A1   order           F_O_Mamiellophyceae      O
#6 A1  family         F_O_C_Mamiellophyceae      F
#7 A1   genus               G__Crustomastix      G
#8 A1 species S__Crustomastix_sp._MBIC10709      S

Upvotes: 2

Ajay Ohri
Ajay Ohri

Reputation: 3492

Step 1

I would first create a column with an ifelse

data$colnew=ifelse(is.na(data$Taxonomy),"missed","")

if you did not intend to paste the word missed you can skip this step

Step 2 Take last value

from Replacing NAs with latest non-NA value (see other approaches here)

use the na.locf() function from the zoo package to carry the last observation forward to replace your NA values

or new function

repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
    if (!forward) x = rev(x)           # reverse x twice if carrying backward
    ind = which(!is.na(x))             # get positions of nonmissing values
    if (is.na(x[1]) && !na.rm)         # if it begins with NA
        ind = c(1,ind)                 # add first pos
    rep_times = diff(                  # diffing the indices + length yields how often
        c(ind, length(x) + 1) )          # they need to be repeated
    if (maxgap < Inf) {
        exceed = rep_times - 1 > maxgap  # exceeding maxgap
        if (any(exceed)) {               # any exceed?
            ind = sort(c(ind[exceed] + 1, ind))      # add NA in gaps
            rep_times = diff(c(ind, length(x) + 1) ) # diff again
        }
    }
    x = rep(x[ind], times = rep_times) # repeat the values at these indices
    if (!forward) x = rev(x)           # second reversion
    x
}

function in also in formr package (Github only). https://github.com/rubenarslan/formr

Step3

Concatenate the two columns (newone) with df$Letter into a third using paste

Upvotes: 1

Related Questions