bilmawr
bilmawr

Reputation: 51

Extract first Non NA value over multiple columns

I'm still learning R and was wondering if I there was an elegant way of manipulating the below df to achieve df2.

I'm not sure if it's a loop that is supposed to be used for this, but basically I want to extract the first Non NA "X_No" Value if the "X_No" value is NA in the first row. This would perhaps be best described through an example from df to the desired df2.

A_ID <- c('A','B','I','N')
A_No <- c(11,NA,15,NA)
B_ID <- c('B','C','D','J')
B_No <- c(NA,NA,12,NA)
C_ID <- c('E','F','G','P')
C_No <- c(NA,13,14,20)
D_ID <- c('J','K','L','M')
D_No <- c(NA,NA,NA,40)
E_ID <- c('W','X','Y','Z')
E_No <- c(50,32,48,40)
df <- data.frame(A_ID,A_No,B_ID,B_No,C_ID,C_No,D_ID,D_No,E_ID,E_No)

ID <- c('A','D','F','M','W')
No <- c(11,12,13,40,50)
df2 <- data.frame(ID,No)


I'm hoping for an elegant solution to this as there are over a 1000 columns similar to the example provided. I've looked all over the web for a similar example however to no avail that would reproduce the expected result.

Your help is very much appreciated. Thankyou

Upvotes: 3

Views: 757

Answers (3)

Wimpel
Wimpel

Reputation: 27732

Here is a data.table solution that should scale well to a (very) large dataset.

functionally

  1. split the data.frame to a list of chunks of columns, based on their names. So all columns startting with A_ go to the first element, all colums startting with B_ to the second

  2. Then, put these list elements on top of each other, using data.table::rbindlist. Ignure the column-namaes (this only works if A_ has the same number of columns as B_ has the same number of cols as n_)

  3. Now get the first non-NA value of each value in the first column

code

library(data.table)
# split based on what comes after the underscore
L <- split.default(df, f = gsub("(.*)_.*", "\\1", names(df)))
# bind together again
DT <- rbindlist(L, use.names = FALSE)
# extract the first value of the non-NA
DT[!is.na(A_No), .(No = A_No[1]), keyby = .(ID = A_ID)]
#    ID No
# 1:  A 11
# 2:  D 12
# 3:  F 13
# 4:  G 14
# 5:  I 15
# 6:  M 40
# 7:  P 20
# 8:  W 50
# 9:  X 32
#10:  Y 48
#11:  Z 40

Upvotes: 1

jared_mamrot
jared_mamrot

Reputation: 26484

I don't know if I'd call it "elegant", but here is a potential solution:

library(tidyverse)

A_ID <- c('A','B','I','N')
A_No <- c(11,NA,15,NA)
B_ID <- c('B','C','D','J')
B_No <- c(NA,NA,12,NA)
C_ID <- c('E','F','G','P')
C_No <- c(NA,13,14,20)
D_ID <- c('J','K','L','M')
D_No <- c(NA,NA,NA,40)
E_ID <- c('W','X','Y','Z')
E_No <- c(50,32,48,40)
df <- data.frame(A_ID,A_No,B_ID,B_No,C_ID,C_No,D_ID,D_No,E_ID,E_No)

ID <- c('A','D','F','M','W')
No <- c(11,12,13,40,50)
df2 <- data.frame(ID,No)

output <- df %>%
  pivot_longer(everything(),
               names_sep = "_",
               names_to = c("Col", ".value")) %>%
  drop_na() %>%
  group_by(Col) %>%
  slice_head(n = 1) %>%
  ungroup() %>%
  select(-Col)

df2
#>   ID No
#> 1  A 11
#> 2  D 12
#> 3  F 13
#> 4  M 40
#> 5  W 50

output
#> # A tibble: 5 × 2
#>   ID       No
#>   <chr> <dbl>
#> 1 A        11
#> 2 D        12
#> 3 F        13
#> 4 M        40
#> 5 W        50

all_equal(df2, output)
#> [1] TRUE

Created on 2023-02-08 with reprex v2.0.2

Upvotes: 3

akrun
akrun

Reputation: 886948

Using base R with max.col (assuming the columns are alternating with ID, No)

ind <- max.col(!is.na(t(df[c(FALSE, TRUE)])), "first")
m1 <- cbind(seq_along(ind), ind)
data.frame(ID = t(df[c(TRUE, FALSE)])[m1], No =  t(df[c(FALSE, TRUE)])[m1])
  ID No
1  A 11
2  D 12
3  F 13
4  M 40
5  W 50

Upvotes: 2

Related Questions