Reputation: 3217
I have an already ordered data frame that looks like the following:
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
ID Level Taxonomy Letter
1 A1 domain D__Eukaryota D
2 A1 kingdom K__Chloroplastida K
3 A1 phylum <NA> P
4 A1 class C__Mamiellophyceae C
5 A1 order <NA> O
6 A1 family <NA> F
7 A1 genus G__Crustomastix G
8 A1 species S__Crustomastix sp. MBIC10709 S
What I would like is to replace the NA values with the last non-NA value, adding the ALL the Letters "missed" at the beginning in a rolling fashion... See what I mean below.
The goal is to obtain a data frame like this:
ID Level Taxonomy Letter
1 A1 domain D__Eukaryota D
2 A1 kingdom K__Chloroplastida K
3 A1 phylum P__K__Chloroplastida P
4 A1 class C__Mamiellophyceae C
5 A1 order O__C__Mamiellophyceae O
6 A1 family F__O__C__Mamiellophyceae F
7 A1 genus G__Crustomastix G
8 A1 species S__Crustomastix sp. MBIC10709 S
Notice the last 2 NAs, how the last one has to carry the value of the previous. See how the first one of the two starts with O__C and the last one with F__O__C.
So far, my best attempt is the following (thanks to Ajay Ohri):
library(zoo)
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
mydf <- data.frame(lapply(mydf, as.character), stringsAsFactors=FALSE)
mydf$Letter2 <- ifelse(is.na(mydf$Taxonomy),paste(mydf$Letter,'__',sep=''),"")
mydf
mydf$Taxonomy <- paste(mydf$Letter2, na.locf(mydf$Taxonomy), sep='')
mydf
Notice how I still don't manage to do it in a rolling manner (I get F__C instead of F__O__C for the last NA). Any help? Thanks!
PS: let me know if it is still confusing, so I make another MWE with more NAs in a row, so it's more obvious what I need.
Upvotes: 2
Views: 1264
Reputation: 2349
An approach that fills NA
values at the beginning with NA
s, and also simplifies the logic to work with groups:
forward_fill <- function (x) {
if (length(x) == 0) return (vector(mode(x), 0))
xt <- tail(x, -1)
x0 <- c(x[1], xt[!is.na(xt)])
id0 <- c(TRUE, !is.na(xt))
y <- x0[cumsum(id0)]
return (y)
}
Upvotes: 0
Reputation: 8770
Since you mentioned memory and performance problems you switched to the accepted data.table
solution.
I am adding another data.table
variant which does not depend on other packages like zoo
and may be fast enough if the Taxonomy column does not contain too long sequences of NA
s since the longest sequence determines the number of repetitions of the while loop (e. g. two rep. in case of the example data):
library(data.table)
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
setDT(mydf)
# Fill NA value in "Taxonomy" with the value of the prev. row until no NAs occur anymore
prev.number.NAs <- 0 # required to stop the loop if no more NA values can be carried forward
repeat {
number.NAs <- sum(is.na(mydf$Taxonomy))
if( number.NAs == 0 | number.NAs == prev.number.NAs) break;
mydf[, filler := shift(Taxonomy), by = .(ID)] # fill temporary working column with the value of the prev. row of the same group
mydf[!is.na(filler) & is.na(Taxonomy), Taxonomy := paste0(Letter, "__", filler)]
prev.number.NAs <- number.NAs
}
mydf[, filler := NULL] # remove working column
mydf
Unfortunately the shift
function of data.table
does not offer a "last observation carry forward" parameter so that I had to use a while
loop.
Update 1: As @UweBlock mentioned in his comment below I have replaced the while
loop by a repeat
loop to avoid an endless loop in case of an NA
value in the column Taxonomy in the first row. THX for finding this!
Update 2: Carying forward the last observation is now only done within the same group of data (defined by the columns ID - as the OP indicated in a comment). Thx to @UweBlock for pointing out this issue!
Upvotes: 1
Reputation: 42592
As the OP has mentioned that memory consumption is crucial, here is a data.table
approach which uses the na.locf()
function from the zoo
package:
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table, convert factors to characters
DT <- data.table(mydf)[, lapply(.SD, as.character)]
# set marker for NA rows
DT[, na := is.na(Taxonomy)][]
# fill NA by Last Observation Carried Forward
DT[, Taxonomy := zoo::na.locf(Taxonomy)][]
# create list of Letters and unique row count within each group of missing taxonomies
DT[(na), `:=`(tmp = .(Letter), rn = seq_len(.N)), by = .(ID, Taxonomy)][]
# replace incomplete taxonomies
DT[(na), Taxonomy := paste(c(rev(unlist(tmp)[1:rn]), Taxonomy), collapse = "__"),
by = .(ID, Taxonomy, rn)][]
# clean up
DT[, c("na", "tmp", "rn") := NULL][]
ID Level Taxonomy Letter 1: A1 domain D__Eukaryota D 2: A1 kingdom K__Chloroplastida K 3: A1 phylum P__K__Chloroplastida P 4: A1 class C__Mamiellophyceae C 5: A1 order O__C__Mamiellophyceae O 6: A1 family F__O__C__Mamiellophyceae F 7: A1 genus G__Crustomastix G 8: A1 species S__Crustomastix sp. MBIC10709 S
I've refrained from chaining the expressions, so the code can be executed step by step.
Note that data.table
is updating in place without copying the whole data set which saves memory as well as time.
In response to this comment, the OP has confirmed that the starting data frame is ordered and non-redundant and that ID+Level should be the unique key of the data frame.
However, as the solution above depends on these assumptions it is worthwhile to add some checks:
# (1) ID + Level are unique keys: find duplicate Levels per ID
stopifnot(anyDuplicated(DT, by = c("ID", "Level")) == 0L)
# (2) rows missing: count rows per ID, there should be 8 Levels
DT[, .N, by = ID][, stopifnot(all(N == 8L))]
# (3) order, wrong Level names, and tests (1) and (2) as well
# create data.table with Level in proper order and a sequence number ln
levels <- data.table(
ln = 1:8,
Level = c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
)
# left inner join, i.e., keep only rows with matching Level, keep order of DT
# then check for consecutively ascending level sequence numbers
levels[DT, on = "Level", nomatch = 0][, stopifnot(all(diff(ln) == 1L)), by = ID]
In addition, it has to be made sure that at least for the top Level
"domain" the Taxonomy
is specified. This can be doublechecked with:
# count number of rows with missing Taxonomy on top level "domain"
stopifnot(nrow(DT[Level == "domain" & is.na(Taxonomy)] == 0L))
The grouping logic by = .(ID, Taxonomy)
is been used together with the selection on na
, i.e. DT[(na), ...
, in order to prepend the additional letters to Taxonomy
where Taxonomy
was originally missing. During development of the solution, I had introduced an additional helper column gn := rleid(ID, Taxonomy)
which would cover duplicates as mentioned in this comment, Finally, I recognized that I can scrape this column because of the prerequisites.
Upvotes: 3
Reputation: 51592
One way to do this is to use Reduce
with accumulate = TRUE
argument. i.e.
ind <- is.na(mydf$Taxonomy)
mydf$Taxonomy <- zoo::na.locf(mydf$Taxonomy)
mydf$Taxonomy[ind] <- paste0(with(mydf, ave(Level, Taxonomy, FUN = function(i)
Reduce(paste, toupper(substr(rev(i), 1, 1)), accumulate = TRUE)))[ind], '_',
sub('.*_', '', mydf$Taxonomy[ind]))
mydf$Taxonomy <- gsub(' ', '_', mydf$Taxonomy)
mydf
# ID Level Taxonomy Letter
#1 A1 domain D__Eukaryota D
#2 A1 kingdom K__Chloroplastida K
#3 A1 phylum P_K_Chloroplastida P
#4 A1 class C__Mamiellophyceae C
#5 A1 order F_O_Mamiellophyceae O
#6 A1 family F_O_C_Mamiellophyceae F
#7 A1 genus G__Crustomastix G
#8 A1 species S__Crustomastix_sp._MBIC10709 S
Upvotes: 2
Reputation: 3492
Step 1
I would first create a column with an ifelse
data$colnew=ifelse(is.na(data$Taxonomy),"missed","")
if you did not intend to paste the word missed you can skip this step
Step 2 Take last value
from Replacing NAs with latest non-NA value (see other approaches here)
use the na.locf() function from the zoo package to carry the last observation forward to replace your NA values
or new function
repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
if (!forward) x = rev(x) # reverse x twice if carrying backward
ind = which(!is.na(x)) # get positions of nonmissing values
if (is.na(x[1]) && !na.rm) # if it begins with NA
ind = c(1,ind) # add first pos
rep_times = diff( # diffing the indices + length yields how often
c(ind, length(x) + 1) ) # they need to be repeated
if (maxgap < Inf) {
exceed = rep_times - 1 > maxgap # exceeding maxgap
if (any(exceed)) { # any exceed?
ind = sort(c(ind[exceed] + 1, ind)) # add NA in gaps
rep_times = diff(c(ind, length(x) + 1) ) # diff again
}
}
x = rep(x[ind], times = rep_times) # repeat the values at these indices
if (!forward) x = rev(x) # second reversion
x
}
function in also in formr package (Github only). https://github.com/rubenarslan/formr
Step3
Concatenate the two columns (newone) with df$Letter into a third using paste
Upvotes: 1