Mee
Mee

Reputation: 161

How to mark rows with NAs which were replaced by na.approx()?

I have a dataset with multiple columns and more than 60k rows. I am using na.approx() to replace NAs in each column. I would like to identify the rows which had NA and were replaced, something like "Dataset - After removing NAs". I would appreciate if you could please help me with that.

Dataset - Before removing NAs
A    B    C
1.0  2.2  3.3
NA   5.7  8.8
6.2  7.2  NA
3.0  NA   1.1

Dataset - After removing NAs
A    A_NA   B     B_NA    C    C_NA
1.0  No     2.2   No      3.3  No
3.6  Yes    5.7   No      8.8  No
6.2  No     7.2   No      4.6  Yes
3.0  No     7.2   Yes     1.1  No 

Sorry, I cannot share the original dataset, but if it is necessary I can prepare a better example.

Thanks

UPDATE

Sorry, I forgot to mention that in my original dataset, I have some columns in my dataset which are not numeric, also my dataset is a data.table.

Dataset - Before removing NAs
ID        Trial     Weight    Height    Depth     
1         Wheat     1.0       2.2       3.3
1         Wheat     NA        5.7       8.8
2         Maize     6.2       7.2       NA
4         Maize     3.0       NA        1.1

Upvotes: 4

Views: 137

Answers (6)

G. Grothendieck
G. Grothendieck

Reputation: 269526

1) Use na.approx and then bind it to the Yes/No data frame. Note that the pipe to setNames is within the second argument to cbind whereas the final pipe is taking the result of cbind.

library(zoo)

cbind(na.approx(DF), as.data.frame(ifelse(is.na(DF), "Yes", "No")) |>
             setNames(paste0(names(DF), "_NA")) ) |>
  ( \(x) x[order(names(x))] )()
##     A A_NA   B B_NA    C C_NA
## 1 1.0   No 2.2   No 3.30   No
## 2 3.6  Yes 5.7   No 8.80   No
## 3 6.2   No 7.2   No 4.95  Yes
## 4 3.0   No  NA  Yes 1.10   No

2) Translating that to tidyverse we have the following. This has the advantage that it goes strictly left to right and only explicitly references DF once at the beginning so that if it changes to some other name it only has to be replaced in one spot.

library(dplyr, exclude = c("filter", "lag"))
library(rlang)
library(zoo)

DF %>%
  { bind_cols(na.approx(.), data.frame(ifelse(is.na(.), "Yes", "No")) %>%
       set_names(~ paste0(.x, "_NA")))

  } %>%
  select(sort(names(.)))

##     A A_NA   B B_NA    C C_NA
## 1 1.0   No 2.2   No 3.30   No
## 2 3.6  Yes 5.7   No 8.80   No
## 3 6.2   No 7.2   No 4.95  Yes
## 4 3.0   No  NA  Yes 1.10   No

Note

DF <- data.frame(
  A = c(1, NA, 6.2, 3),
  B = c(2.2, 5.7, 7.2, NA),
  C = c(3.3, 8.8, NA, 1.1)
)

Upvotes: 2

jpsmith
jpsmith

Reputation: 17195

You could do this in base R by indexing:

#new columns if it is NA or not
df[paste0(names(df),"_NA")] <- is.na(df)

# use zoo::na.approx
df[-grep("_NA", names(df))] <- zoo::na.approx(df[-grep("_NA", names(df))])

# reorder (if desired)
df <- df[,sort(names(df))]

Output:

    A  A_NA   B  B_NA    C  C_NA
1 1.0 FALSE 2.2 FALSE 3.30 FALSE
2 3.6  TRUE 5.7 FALSE 8.80 FALSE
3 6.2 FALSE 7.2 FALSE 4.95  TRUE
4 3.0 FALSE  NA  TRUE 1.10 FALSE

If you wanted "Yes" or "No" instead of T/F, you could just replace is.na(df) with ifelse(is.na(df), "Yes", "No")

Data:

df <- read.table(text = "A    B    C
1.0  2.2  3.3
NA   5.7  8.8
6.2  7.2  NA
3.0  NA   1.1", header = TRUE)

EDIT Using data table, you can tweak the code by:

library(data.table)

# Identify numeric columns
idxcols <- sapply(dt, is.numeric)

dt[, paste0(names(idxcols)[idxcols], "_NA") := lapply(.SD, is.na), .SDcols = idxcols]

Data table data:

dt <- data.table::data.table(read.table(text = "ID        Trial     Weight    Height    Depth     
1         Wheat     1.0       2.2       3.3
1         Wheat     NA        5.7       8.8
2         Maize     6.2       7.2       NA
4         Maize     3.0       NA        1.1", header = TRUE))

Upvotes: 4

Ronak Shah
Ronak Shah

Reputation: 388962

I'll add a dplyr flavor to already existing answers.

library(dplyr)

df %>%
  transmute(across(everything(), is.na, .names = "{col}_NA")) %>%
  bind_cols(zoo::na.approx(df, rule = 2)) %>%
  select(order(names(.)))

#    A  A_NA   B  B_NA    C  C_NA
#1 1.0 FALSE 2.2 FALSE 3.30 FALSE
#2 3.6  TRUE 5.7 FALSE 8.80 FALSE
#3 6.2 FALSE 7.2 FALSE 4.95  TRUE
#4 3.0 FALSE 7.2  TRUE 1.10 FALSE

Upvotes: 2

ThomasIsCoding
ThomasIsCoding

Reputation: 101257

Probably you should specify rule=2 when applying na.approx

cbind(
  zoo::na.approx(df, rule = 2),
  as.data.frame(
    ifelse(is.na(df), "Yes", "No"),
    col.names = paste0(names(df), "_NA")
  )
)[order(rep(seq_along(df), 2))]

which gives

    A A_NA   B B_NA    C C_NA
1 1.0   No 2.2   No 3.30   No
2 3.6  Yes 5.7   No 8.80   No
3 6.2   No 7.2   No 4.95  Yes
4 3.0   No 7.2  Yes 1.10   No

data

> dput(df)
structure(list(A = c(1, NA, 6.2, 3), B = c(2.2, 5.7, 7.2, NA),
    C = c(3.3, 8.8, NA, 1.1)), class = "data.frame", row.names = c(NA, 
-4L))

Upvotes: 3

zerweck
zerweck

Reputation: 726

Alternative answer using only base R:

You can just create the *_NA columns before running na.approx. You can e.g. loop over all columns with is.na and assign the result as a new column. The ordering of the columns in the way you show in your example could be done by sorting the columns alphabetically afterwards.


df <- data.frame(
  A = c(1.0, NA, 6.2, 3.0),
  B = c(2.2, 5.7, 7.2, NA),
  C = c(3.3, 8.8, NA, 1.1)
)
df_isna = lapply(df, is.na)
df_isna = lapply(df_isna, factor, labels = c("No","Yes"))
names(df_isna) = paste0(names(df_isna),"_NA")
df_new = as.data.frame(zoo::na.approx(df, rule = 2))
df_new = cbind(df_new, df_isna)
df_new = df_new[,sort(names(df_new))]

Upvotes: 2

Rui Barradas
Rui Barradas

Reputation: 76402

Write a function to take care of getting the NA's before interpolation and of putting the two tables together.
The code comments should explain how the function does it.

Dataset <- read.table(text = "
A    B    C
1.0  2.2  3.3
NA   5.7  8.8
6.2  7.2  NA
3.0  NA   1.1
", header = TRUE)


fun <- function(x) {
  # auxiliary variable
  cn <- colnames(x)
  # get the NA positions as a logical matrix
  na <- sapply(x, is.na)
  na_char <- matrix("", nrow = nrow(na), ncol = ncol(na))
  # assign the values "No" and "Yes"
  na_char[na] <- "Yes"
  na_char[!na] <- "No"
  # interpolate
  y <- sapply(x, zoo::na.approx, na.rm = FALSE) |> as.data.frame()
  # put the old and new columns in wanted order
  colnames(na_char) <- paste(colnames(na), "NA", sep = "_")
  i <- order(c(cn, colnames(na_char)))
  new_names <- c(cn, colnames(na_char))[i]
  # assign the new names and exit
  setNames(cbind(y, na_char)[, i], new_names)
}

fun(Dataset)
#>     A A_NA   B B_NA    C C_NA
#> 1 1.0   No 2.2   No 3.30   No
#> 2 3.6  Yes 5.7   No 8.80   No
#> 3 6.2   No 7.2   No 4.95  Yes
#> 4 3.0   No  NA  Yes 1.10   No

Created on 2025-01-08 with reprex v2.1.1

Upvotes: 2

Related Questions