Reputation: 161
I have a dataset with multiple columns and more than 60k rows. I am using na.approx()
to replace NAs in each column. I would like to identify the rows which had NA and were replaced, something like "Dataset - After removing NAs". I would appreciate if you could please help me with that.
Dataset - Before removing NAs
A B C
1.0 2.2 3.3
NA 5.7 8.8
6.2 7.2 NA
3.0 NA 1.1
Dataset - After removing NAs
A A_NA B B_NA C C_NA
1.0 No 2.2 No 3.3 No
3.6 Yes 5.7 No 8.8 No
6.2 No 7.2 No 4.6 Yes
3.0 No 7.2 Yes 1.1 No
Sorry, I cannot share the original dataset, but if it is necessary I can prepare a better example.
Thanks
UPDATE
Sorry, I forgot to mention that in my original dataset, I have some columns in my dataset which are not numeric, also my dataset is a data.table.
Dataset - Before removing NAs
ID Trial Weight Height Depth
1 Wheat 1.0 2.2 3.3
1 Wheat NA 5.7 8.8
2 Maize 6.2 7.2 NA
4 Maize 3.0 NA 1.1
Upvotes: 4
Views: 137
Reputation: 269526
1) Use na.approx
and then bind it to the Yes/No data frame. Note that the pipe to setNames
is within the second argument to cbind
whereas the final pipe is taking the result of cbind
.
library(zoo)
cbind(na.approx(DF), as.data.frame(ifelse(is.na(DF), "Yes", "No")) |>
setNames(paste0(names(DF), "_NA")) ) |>
( \(x) x[order(names(x))] )()
## A A_NA B B_NA C C_NA
## 1 1.0 No 2.2 No 3.30 No
## 2 3.6 Yes 5.7 No 8.80 No
## 3 6.2 No 7.2 No 4.95 Yes
## 4 3.0 No NA Yes 1.10 No
2) Translating that to tidyverse we have the following. This has the advantage that it goes strictly left to right and only explicitly references DF
once at the beginning so that if it changes to some other name it only has to be replaced in one spot.
library(dplyr, exclude = c("filter", "lag"))
library(rlang)
library(zoo)
DF %>%
{ bind_cols(na.approx(.), data.frame(ifelse(is.na(.), "Yes", "No")) %>%
set_names(~ paste0(.x, "_NA")))
} %>%
select(sort(names(.)))
## A A_NA B B_NA C C_NA
## 1 1.0 No 2.2 No 3.30 No
## 2 3.6 Yes 5.7 No 8.80 No
## 3 6.2 No 7.2 No 4.95 Yes
## 4 3.0 No NA Yes 1.10 No
DF <- data.frame(
A = c(1, NA, 6.2, 3),
B = c(2.2, 5.7, 7.2, NA),
C = c(3.3, 8.8, NA, 1.1)
)
Upvotes: 2
Reputation: 17195
You could do this in base R by indexing:
#new columns if it is NA or not
df[paste0(names(df),"_NA")] <- is.na(df)
# use zoo::na.approx
df[-grep("_NA", names(df))] <- zoo::na.approx(df[-grep("_NA", names(df))])
# reorder (if desired)
df <- df[,sort(names(df))]
Output:
A A_NA B B_NA C C_NA
1 1.0 FALSE 2.2 FALSE 3.30 FALSE
2 3.6 TRUE 5.7 FALSE 8.80 FALSE
3 6.2 FALSE 7.2 FALSE 4.95 TRUE
4 3.0 FALSE NA TRUE 1.10 FALSE
If you wanted "Yes" or "No" instead of T/F, you could just replace is.na(df)
with ifelse(is.na(df), "Yes", "No")
Data:
df <- read.table(text = "A B C
1.0 2.2 3.3
NA 5.7 8.8
6.2 7.2 NA
3.0 NA 1.1", header = TRUE)
EDIT Using data table, you can tweak the code by:
library(data.table)
# Identify numeric columns
idxcols <- sapply(dt, is.numeric)
dt[, paste0(names(idxcols)[idxcols], "_NA") := lapply(.SD, is.na), .SDcols = idxcols]
Data table data:
dt <- data.table::data.table(read.table(text = "ID Trial Weight Height Depth
1 Wheat 1.0 2.2 3.3
1 Wheat NA 5.7 8.8
2 Maize 6.2 7.2 NA
4 Maize 3.0 NA 1.1", header = TRUE))
Upvotes: 4
Reputation: 388962
I'll add a dplyr
flavor to already existing answers.
library(dplyr)
df %>%
transmute(across(everything(), is.na, .names = "{col}_NA")) %>%
bind_cols(zoo::na.approx(df, rule = 2)) %>%
select(order(names(.)))
# A A_NA B B_NA C C_NA
#1 1.0 FALSE 2.2 FALSE 3.30 FALSE
#2 3.6 TRUE 5.7 FALSE 8.80 FALSE
#3 6.2 FALSE 7.2 FALSE 4.95 TRUE
#4 3.0 FALSE 7.2 TRUE 1.10 FALSE
Upvotes: 2
Reputation: 101257
Probably you should specify rule=2
when applying na.approx
cbind(
zoo::na.approx(df, rule = 2),
as.data.frame(
ifelse(is.na(df), "Yes", "No"),
col.names = paste0(names(df), "_NA")
)
)[order(rep(seq_along(df), 2))]
which gives
A A_NA B B_NA C C_NA
1 1.0 No 2.2 No 3.30 No
2 3.6 Yes 5.7 No 8.80 No
3 6.2 No 7.2 No 4.95 Yes
4 3.0 No 7.2 Yes 1.10 No
> dput(df)
structure(list(A = c(1, NA, 6.2, 3), B = c(2.2, 5.7, 7.2, NA),
C = c(3.3, 8.8, NA, 1.1)), class = "data.frame", row.names = c(NA,
-4L))
Upvotes: 3
Reputation: 726
Alternative answer using only base R:
You can just create the *_NA columns before running na.approx. You can e.g. loop over all columns with is.na and assign the result as a new column. The ordering of the columns in the way you show in your example could be done by sorting the columns alphabetically afterwards.
df <- data.frame(
A = c(1.0, NA, 6.2, 3.0),
B = c(2.2, 5.7, 7.2, NA),
C = c(3.3, 8.8, NA, 1.1)
)
df_isna = lapply(df, is.na)
df_isna = lapply(df_isna, factor, labels = c("No","Yes"))
names(df_isna) = paste0(names(df_isna),"_NA")
df_new = as.data.frame(zoo::na.approx(df, rule = 2))
df_new = cbind(df_new, df_isna)
df_new = df_new[,sort(names(df_new))]
Upvotes: 2
Reputation: 76402
Write a function to take care of getting the NA
's before interpolation and of putting the two tables together.
The code comments should explain how the function does it.
Dataset <- read.table(text = "
A B C
1.0 2.2 3.3
NA 5.7 8.8
6.2 7.2 NA
3.0 NA 1.1
", header = TRUE)
fun <- function(x) {
# auxiliary variable
cn <- colnames(x)
# get the NA positions as a logical matrix
na <- sapply(x, is.na)
na_char <- matrix("", nrow = nrow(na), ncol = ncol(na))
# assign the values "No" and "Yes"
na_char[na] <- "Yes"
na_char[!na] <- "No"
# interpolate
y <- sapply(x, zoo::na.approx, na.rm = FALSE) |> as.data.frame()
# put the old and new columns in wanted order
colnames(na_char) <- paste(colnames(na), "NA", sep = "_")
i <- order(c(cn, colnames(na_char)))
new_names <- c(cn, colnames(na_char))[i]
# assign the new names and exit
setNames(cbind(y, na_char)[, i], new_names)
}
fun(Dataset)
#> A A_NA B B_NA C C_NA
#> 1 1.0 No 2.2 No 3.30 No
#> 2 3.6 Yes 5.7 No 8.80 No
#> 3 6.2 No 7.2 No 4.95 Yes
#> 4 3.0 No NA Yes 1.10 No
Created on 2025-01-08 with reprex v2.1.1
Upvotes: 2