McMahok
McMahok

Reputation: 360

Replacing missing values by searching dataframe for those values by matching

I have a dataframe with ~ 500,000 observations, some of the cols have blanks, e.g. a missing name. However the names correspond with an ID, sometimes the name or ID are both there, but sometimes one or the other is missing. Is there a way to search through the df, find and match the missing data

So in this simple example

structure(list(Names = c("Jon", "Al", "Phil", "Sarah", "Jon", 
" "), ID = c(12, 14, 16, 18, 12, 14)), class = "data.frame", row.names = c(NA, 
-6L))


df

  Names ID
1   Jon 12
2    Al 14
3  Phil 16
4 Sarah 18
5   Jon 12
6       14

I know that 14 matches with Al So I could replace the blank on row 6 with Al. But how can I do this on a larger scale with multiple missing values?

Upvotes: 1

Views: 26

Answers (1)

akrun
akrun

Reputation: 886938

Assuming that ID have only a single unique 'Names' - do a group by 'ID' and replace the 'Names' that are blank ("") with the first 'Names'

library(dplyr)
df %>%
   mutate(Names = trimws(Names)) %>%
   group_by(ID) %>% 
   mutate(Names = replace(Names, Names == "", 
           first(Names[Names != ""]))) %>% 
   ungroup

-output

# A tibble: 6 × 2
  Names    ID
  <chr> <dbl>
1 Jon      12
2 Al       14
3 Phil     16
4 Sarah    18
5 Jon      12
6 Al       14

Another option is to replace the "" to NA and then use fill to replace the missing values by 'ID' with the non-NA adjacent Names

library(tidyr)
df %>% 
   mutate(Names = na_if(trimws(Names), "")) %>% 
   group_by(ID) %>% 
   fill(Names, .direction = 'downup') %>%
   ungroup
# A tibble: 6 × 2
  Names    ID
  <chr> <dbl>
1 Jon      12
2 Al       14
3 Phil     16
4 Sarah    18
5 Jon      12
6 Al       14

Upvotes: 1

Related Questions