Reputation: 360
I have a dataframe with ~ 500,000 observations, some of the cols have blanks, e.g. a missing name. However the names correspond with an ID, sometimes the name or ID are both there, but sometimes one or the other is missing. Is there a way to search through the df, find and match the missing data
So in this simple example
structure(list(Names = c("Jon", "Al", "Phil", "Sarah", "Jon",
" "), ID = c(12, 14, 16, 18, 12, 14)), class = "data.frame", row.names = c(NA,
-6L))
df
Names ID
1 Jon 12
2 Al 14
3 Phil 16
4 Sarah 18
5 Jon 12
6 14
I know that 14 matches with Al
So I could replace the blank on row 6 with Al.
But how can I do this on a larger scale with multiple missing values?
Upvotes: 1
Views: 26
Reputation: 886938
Assuming that ID
have only a single unique 'Names' - do a group by 'ID' and replace the 'Names' that are blank (""
) with the first 'Names'
library(dplyr)
df %>%
mutate(Names = trimws(Names)) %>%
group_by(ID) %>%
mutate(Names = replace(Names, Names == "",
first(Names[Names != ""]))) %>%
ungroup
-output
# A tibble: 6 × 2
Names ID
<chr> <dbl>
1 Jon 12
2 Al 14
3 Phil 16
4 Sarah 18
5 Jon 12
6 Al 14
Another option is to replace the ""
to NA
and then use fill
to replace the missing values by 'ID' with the non-NA adjacent Names
library(tidyr)
df %>%
mutate(Names = na_if(trimws(Names), "")) %>%
group_by(ID) %>%
fill(Names, .direction = 'downup') %>%
ungroup
# A tibble: 6 × 2
Names ID
<chr> <dbl>
1 Jon 12
2 Al 14
3 Phil 16
4 Sarah 18
5 Jon 12
6 Al 14
Upvotes: 1